On this page
Redis and BullMQ Queue Patterns
Comprehensive guide to Redis-backed job queues with BullMQ in Node.js/NestJS
Background jobs in Node.js applications — API calls, sync operations, notifications — need reliability guarantees that in-process execution cannot provide. This post covers how Redis and BullMQ solve that problem, from data structures and threading models to error classification and dead letter queues.
The Problem
Background jobs in Node.js applications (API calls, sync operations, notifications) need reliability guarantees that in-process execution cannot provide. When a server crashes mid-job, the work is lost. When an external API rate-limits you, there is no built-in retry. When traffic spikes, the system has no way to absorb and smooth out load.
Difficulties Encountered
- Misconception that BullMQ uses separate threads — Initial assumption was that BullMQ workers run on a separate thread; in reality they share the Node.js event loop, which changes how you reason about concurrency and blocking
- Understanding Redis’s role beyond caching — Redis is commonly taught as “a cache,” so recognizing it as a coordination system (atomic operations, pub/sub, sorted sets for delayed jobs) required a mental model shift
- Race conditions with concurrent updates — Without queue-based sequencing, rapid user actions (update then delete) caused out-of-order processing that was hard to reproduce and debug
- Choosing between EventEmitter, Promise chains, and BullMQ — Each pattern looks similar on the surface; the key differentiator (persistence + retry + monitoring) only becomes obvious after a production incident
Why Not “Run and Forget”
The naive approach has critical production issues:
// BAD: Run and forget
async function handleBlockUpdate(block) {
try {
await googleCalendarAPI.update(block);
} catch (error) {
console.error(error); // Lost forever
}
} Problems:
- Lost on server crash
- No retry mechanism
- No rate limiting
- No deduplication
- No monitoring/observability
- Can’t handle burst traffic
With Redis + BullMQ:
// GOOD: Production-grade
await queue.add(
"update-calendar",
{
blockId: block.id,
snapshot: extractSnapshot(block),
intent: "update"
},
{
jobId: `block-${block.id}-update`, // Deduplication
attempts: 3, // Auto-retry
backoff: { type: "exponential" }, // Smart delays
priority: urgent ? 1 : 10 // Priority queue
}
); Redis: The Coordination System
What Redis Actually Is
- In-memory data structure store (not just a cache)
- Persistent (survives restarts)
- Single-threaded for commands = guaranteed atomicity
- Handles 100,000+ ops/second on single core
Key Problems Redis Solves
1. Persistence & Reliability
// Job stored in Redis:
{
id: "job-123",
data: { blockId: 456, intent: "update" },
attempts: 1,
status: "active"
}
// Survives app crashes, will be retried on restart 2. Distributed Coordination
// Atomic operation (BRPOPLPUSH):
// 1. Remove job from waiting list
// 2. Add to processing list
// 3. Return to exactly ONE worker
// All in single atomic operation - no race conditions 3. Rate Limiting
new Worker("calendar-queue", processor, {
limiter: {
max: 100, // Max 100 jobs
duration: 60000 // Per minute
}
}); 4. Retry Logic
{
attempts: 3,
backoff: {
type: 'exponential',
delay: 2000 // 2s, 4s, 8s...
}
} BullMQ Data Structures
BullMQ uses Redis data structures for different purposes:
| Structure | Use | Example |
|---|---|---|
| Lists | FIFO queue | waiting: [job3, job2, job1] |
| Sorted Sets | Delayed jobs | {score: timestamp, member: "job-123"} |
| Hashes | Job data | job:123 {data: "...", opts: "..."} |
| Sets | Deduplication | completed: {job-123, job-124} |
Lists (FIFO Queue)
waiting jobs: [job3, job2, job1] // job1 processed first
BRPOPLPUSH removes from right, ensures FIFO order Redis BRPOPLPUSH (blocking right-pop, left-push) atomically moves a job from
the waiting list to the active list and returns it to exactly one worker. This
atomic operation prevents duplicate processing across multiple workers.
Sorted Sets (Delayed/Scheduled Jobs)
delayed jobs: {
score: 1703001234567 (timestamp),
member: "job-123"
}
// Jobs become available when current time > score BullMQ polls the sorted set and moves jobs whose score (scheduled timestamp) is less than or equal to the current time back into the waiting list.
Hashes (Job Data Storage)
job:123 {
data: "{ blockId: 456, snapshot: {...} }",
opts: "{ attempts: 3, delay: 5000 }",
timestamp: "1703001234567"
} Each job’s full payload, options, and metadata are stored as a Redis hash, allowing partial reads of individual fields without deserializing the entire job.
Sets (Job Deduplication)
completed jobs: {job-123, job-124, job-125}
// Check if job already processed via SISMEMBER Combined with the jobId option, BullMQ uses sets to track completed and failed
jobs, preventing re-processing of duplicate submissions.
Threading Model
Common Misconception: BullMQ Uses Separate Threads
Q: Does BullMQ use a separate thread from the main NestJS service?
A: No. BullMQ runs in the same Node.js process and same event loop. Node.js is single-threaded. The BullMQ worker, NestJS controllers, services, and database queries all share one event loop. This means a CPU-intensive job in a BullMQ worker blocks the entire application, including HTTP request handling.
Main Thread (Event Loop)
├── NestJS Controllers
├── Services
├── Database queries
└── BullMQ Workers ← SAME THREAD BullMQ can run in a separate process (not thread) by creating a standalone
worker file, but the default NestJS integration (@Processor decorator) runs
in-process:
// OPTION 1: Same process (default NestJS integration)
@Module({
imports: [
BullModule.registerQueue({ name: "google-calendar-event" }),
],
providers: [QueueProcessor], // Worker in same process
})
// OPTION 2: Separate process (standalone worker file)
// worker.ts - run with `node worker.ts`
import { Worker } from "bullmq";
const worker = new Worker("google-calendar-event", async (job) => {
// Runs in completely separate process
}); How It’s Still Non-Blocking
// Sync workflow - returns immediately
await this.queue.add("create-channel", data);
// ↑ Job added to Redis, continues immediately
return { success: true };
// ↑ Returns to user immediately
// Later in event loop:
// Worker processes the job asynchronously BullMQ Concurrency
@Processor("google-calendar-event", {
concurrency: 5 // Process up to 5 jobs simultaneously
})
export class QueueProcessor extends WorkerHost {
async process(job: Job): Promise<void> {
// Each job runs independently
}
} Non-Blocking Solutions Comparison
| Solution | Non-blocking | Persistence | Retry | Monitoring | Complexity | Use Case |
|---|---|---|---|---|---|---|
| EventEmitter | Yes | No | No | No | Simple | Fire-and-forget, in-memory |
| Promise Chain | Yes | No | Manual | No | Simple | Simple async operations |
| Worker Threads | Yes | No | Manual | No | Complex | CPU-intensive tasks |
| BullMQ | Yes | Yes | Yes | Yes | Moderate | Critical jobs with retry |
| Microservices | Yes | Yes | Yes | Yes | Very complex | Large-scale distributed |
Choosing the Right Solution
- Simple fire-and-forget — EventEmitter or Promise chain
- Critical operations — BullMQ (persistence + retry + monitoring)
- CPU-intensive work — Worker Threads (true parallelism, separate CPU thread)
- Large-scale distributed — Microservices (separate services, independent scaling, but massive infrastructure overhead)
EventEmitter Pattern
// Publisher
this.eventEmitter.emit('channel.create', data);
// Handler
@OnEvent('channel.create')
async handleChannelCreate(data): Promise<void> {
await this.queue.add('create-channel', data);
} Note: EventEmitter does nothing by itself - it only calls registered handlers. If no handler exists, event is silently lost.
Promise Chain Pattern
this.service
.doSomething()
.then(() => console.log("done"))
.catch((err) => console.error(err));
// Returns immediately, runs async Use for: Simple, non-critical operations only.
When to Use BullMQ
Use BullMQ when you need:
- Persistence - Jobs must survive crashes
- Retry Logic - External APIs can fail temporarily
- Monitoring - Need to track job status and failures
- Rate Limiting - Prevent API throttling
- Deduplication - Prevent duplicate processing
- Priority Queue - Some jobs are more urgent
Example: Calendar Sync
// Channel creation is CRITICAL
// - If lost, calendar won't receive updates for 24+ hours
// - Google API can fail (rate limits, timeouts)
// - Need visibility into failures
await this.queue.add("create-channel", data, {
attempts: 3,
backoff: { type: "exponential", delay: 1000 }
}); When NOT to Use
- Simple fire-and-forget events — If job loss is acceptable and no retry is needed, an EventEmitter or plain Promise is simpler and has zero infrastructure overhead
- CPU-bound computation — BullMQ shares the Node.js event loop; CPU-intensive work blocks other jobs. Use Worker Threads or a separate process instead
- Sub-millisecond latency requirements — The Redis round-trip adds latency (~1-5ms); for real-time hot paths where every millisecond matters, direct function calls are faster
- Single-use scripts or CLI tools — Adding Redis as a dependency for a one-off script is unnecessary complexity
- When you need exactly-once delivery — BullMQ provides at-least-once semantics; if duplicate processing causes data corruption, you need idempotency guards on top of BullMQ
Options Considered
| Option | Pros | Cons |
|---|---|---|
| BullMQ + Redis | Persistence, retry, rate limiting, monitoring, deduplication | Requires Redis infrastructure; at-least-once only |
| EventEmitter | Zero deps; in-process; simple | No persistence; lost on crash; no retry |
| Promise Chain | Native JS; no deps | No persistence; manual retry; no monitoring |
| Worker Threads | True parallelism for CPU work | No persistence; manual retry; complex IPC |
| AWS SQS / Microservices | Managed; scales independently; cross-service | Higher latency; more infrastructure; overkill for single-service |
Why This Approach
Chose BullMQ + Redis because the calendar sync use case requires all three: persistence (jobs must survive crashes), automatic retry (Google API rate limits), and observability (tracking failed syncs). EventEmitter and Promise chains lack persistence. Worker Threads solve a different problem (CPU parallelism). SQS was overkill for a single NestJS service that already uses Redis for other purposes.
Race Condition Prevention
Real-World Scenario: UPDATE Then DELETE
A user edits a calendar block and then immediately deletes it. Without a queue, both operations fire as independent async calls to the Google Calendar API:
// Without queue: race condition
updateBlock(id); // Takes 2 seconds (Google API round-trip)
deleteBlock(id); // Takes 1 second
// DELETE completes first!
// UPDATE then either fails (404) or RECREATES the deleted item
// Result: ghost calendar event that the user thought was deleted This is hard to reproduce because it depends on network timing. It only surfaces when a user acts fast enough for the requests to overlap.
With Redis + BullMQ, jobs for the same resource are processed sequentially:
// With queue: ordered processing
await queue.add("update", { blockId: 123 });
await queue.add("delete", { blockId: 123 });
// Redis ensures sequential processing for blockId: 123
// UPDATE completes first, then DELETE runs -- correct order guaranteed Distributed Locking
private async acquireLock(blockId: number, ttl: number = 30): Promise<boolean> {
const lockKey = `block-lock:${blockId}`;
const redisClient = await this.queue.client;
const result = await redisClient.set(lockKey, lockValue, 'EX', ttl, 'NX');
return result === 'OK';
} Monitoring and Observability
// See all failed jobs
await queue.getFailed();
// See waiting jobs
await queue.getWaiting();
// Get metrics
const counts = await queue.getJobCounts();
// { waiting: 5, active: 2, completed: 100, failed: 3 }
// Get specific job status
const job = await queue.getJob(jobId);
console.log(job.failedReason); Architecture Summary
B[Process] --> C[Output] `} /> ``` NOTE: Curly braces in mermaid code will be interpreted as Svelte expressions. Either escape them or avoid using braces in labels. REFERENCES: - MDsveX + Mermaid issue: https://github.com/pngwn/MDsveX/issues/737 - MDsveX plugin discussion: https://github.com/pngwn/MDsveX/discussions/354 - Svelte Mermaid approach: https://jamesjoy.site/posts/2023-06-26-svelte-mermaidjs -->Benefits:
- Reliability: 99.9%+ job completion rate
- Scalability: Add workers to handle more load
- Observability: Track every job’s status
- Maintainability: Clear separation of concerns
- Performance: Absorb traffic spikes, smooth out load
Error Handling and DLQ Patterns
Retry logic and monitoring are only half the story. What happens when all retries are exhausted? Without a deliberate failure path, jobs vanish silently — and you never know they failed.
Dead Letter Queue (DLQ)
When all retries are exhausted, failed jobs need a recovery path. A common
mistake is setting removeOnFail: true to keep the failed list clean. The
problem: failed jobs disappear with no audit trail. You cannot inspect them,
replay them, or even count them.
// BAD: Silent failure
await queue.add("sync", data, {
attempts: 20,
backoff: { type: "fixed", delay: 100 },
removeOnFail: true, // Job vanishes after 20 retries — no trace left
});
// GOOD: DLQ captures failures for investigation
await queue.add("sync", data, {
attempts: 5,
backoff: { type: "exponential", delay: 2000 },
removeOnFail: false, // Job stays in failed state for inspection
}); With removeOnFail: false, failed jobs remain queryable via queue.getFailed(). You can build alerting on top of this (e.g., Slack
notification when failed count exceeds a threshold) and replay jobs after fixing
the underlying issue.
Retryable vs Non-Retryable Errors
Not all errors deserve retries. A 400 Bad Request will fail the same way on
every attempt — retrying it wastes your retry budget and delays other jobs. A 503 Service Unavailable or 429 Too Many Requests is transient and worth
retrying. Distinguish permanent client errors from transient server errors:
function isRetryableError(error: any): boolean {
const status = error?.response?.status;
const nonRetryable = [400, 401, 403, 404];
return !nonRetryable.includes(status);
// 500, 503, 429 → retry (server issue or rate limit)
// 400, 401, 403, 404 → don't retry (client error, won't change)
} Use this classification in your worker to short-circuit retries on permanent failures. Move non-retryable jobs directly to the DLQ with full error context instead of burning through the retry budget.
BullMQ UnrecoverableError
The isRetryableError classification above works when you control the retry logic yourself. But what about the retries BullMQ manages automatically? If your retry config is designed for one failure mode (e.g., lock contention with attempts: 20, delay: 100ms) but your processor also throws permanent errors (e.g., business logic failures), BullMQ retries everything — including errors that will never succeed.
BullMQ’s UnrecoverableError solves this. When thrown from a processor, it tells BullMQ to skip all remaining retry attempts and move the job directly to the failed state:
import { UnrecoverableError } from 'bullmq';
async process(job: Job): Promise<void> {
try {
await this.executeJob(job);
} catch (error) {
// Business logic errors (ArchException hierarchy) are permanent
if (error instanceof ArchException) {
throw new UnrecoverableError(error.message);
}
// Lock contention, network errors → let BullMQ retry
throw error;
}
} The key pattern here is the exception hierarchy boundary. All business logic exceptions extend a common base class (ArchException), while infrastructure errors are plain Error. This makes instanceof ArchException a reliable separator between retryable and non-retryable errors without maintaining an explicit error-type list. When you add a new business exception (e.g., ResourceNotFoundException), it automatically inherits ArchException and gets the UnrecoverableError treatment.
I learned this the hard way. A stale gcalId caused updateEvent() to throw ServerLogicException (an ArchException subclass) 20 times in 10 seconds — the retry config (attempts: 20, delay: 100ms) was designed for Redis lock contention, but it also retried permanent business logic failures. Adding UnrecoverableError eliminated the retry storm entirely.
Layered Retry Strategy
A single retry mechanism is brittle. Network blips resolve in milliseconds; API outages last minutes. Combine three layers, each handling a different failure duration:
Layer 1: In-process retry (exponential, 3 attempts, ~15s total)
↓ still failing
Layer 2: Queue-level retry (exponential, 5 attempts, ~10min total)
↓ still failing
Layer 3: DLQ (manual inspection + alerting) Layer 1 catches transient network blips — a dropped connection, a brief DNS hiccup. These resolve in seconds. Layer 2 handles longer outages — an external API down for maintenance, a rate limit window. Layer 3 is for failures that need human investigation — a revoked API key, a schema change in the external service, or a bug in your serialization logic.
Rate Limit Awareness (429)
When an API returns 429 Too Many Requests, it often includes a Retry-After header telling you exactly how long to wait. Ignoring this header and using your
own fixed backoff is wasteful — you either wait too long or not long enough.
BullMQ’s DelayedError lets you reschedule the job with the exact delay the API
requested:
if (error.response?.status === 429) {
const retryAfter = parseInt(
error.response.headers["retry-after"] ?? "60",
10
);
throw new DelayedError(retryAfter * 1000);
} This is different from a normal retry: DelayedError does not decrement the
attempt counter. The job moves to the delayed set and re-enters the waiting list
after the specified duration, preserving your retry budget for genuine failures.
Key Points
- Redis is not just a cache - It’s a coordination system
- Single-threaded Redis = Atomic operations = No race conditions
- Multi-process workers = Parallel processing with safety
- Persistence = Jobs survive crashes and restarts
- Built-in patterns = Retry, rate limit, deduplication, priority
- Observability = Know what’s happening in your system
- DLQ is essential =
removeOnFail: truesilently loses failures - Classify errors before retrying = 4xx errors waste retry budget
- Use
UnrecoverableErrorfor permanent failures = When thrown from a processor, BullMQ skips all remaining retry attempts. Useinstanceofon your exception hierarchy to cleanly separate retryable (lock contention, network) from non-retryable (auth, not-found, business logic) errors
Further Study Topics
- Redis Streams — Alternative to Lists for queue workloads; supports consumer groups, message acknowledgment, and replay from any point in the stream
- Redis Pub/Sub — Real-time event broadcasting to multiple subscribers; no persistence (messages lost if no subscriber is listening)
- Redis Cluster — Horizontal scaling by sharding data across multiple nodes; needed when a single Redis instance cannot handle the throughput or memory requirements
- BullMQ Pro features — Job groups (rate limit per group), nested queues, and advanced flow control
- Alternative queue systems — RabbitMQ (AMQP protocol, complex routing), Kafka (event streaming at scale), AWS SQS (managed, no infrastructure)