Redis-backed caching for data that changes infrequently but is requested constantly: user profile data, permission structures, product catalogue, configuration values -- with explicit TTLs and cache invalidation on write (MULTI/EXEC transaction to update both the database record and the cache atomically) rather than time-based expiry that serves stale data after an update. Cache-aside pattern with stale-while-revalidate for high-traffic endpoints: serve the cached value immediately while triggering an async background refresh, avoiding the thundering herd problem where a cache expiry for a popular key causes a spike of simultaneous database reads. BullMQ (Redis-backed) for background job queues: email sending, PDF generation, image processing, third-party webhook delivery, and any operation that doesn't need to complete synchronously within the HTTP request lifecycle. Queue configuration with worker concurrency limits (preventing a spike of queued jobs from overwhelming downstream services), retry policies with exponential backoff, and job failure alerting integrated with PagerDuty or Slack so failed background work is visible to on-call. Horizontal scaling preparation: stateless application servers (session tokens validated against Redis or JWT, no in-memory state), database connection pooling via PgBouncer sized to the PostgreSQL max_connections limit divided by the number of application server instances. pg_stat_statements extension enabled to identify the queries accounting for the highest cumulative execution time -- the optimisation targets worth addressing before vertical scaling or adding read replicas. Target: p50 API response under 100ms, p95 under 400ms for typical CRUD endpoints under production load.
Background job observability: BullMQ Dashboard or Taskforce.sh for real-time visibility into queue depth, job processing rate, failed job count, and job latency by queue. Failed jobs logged with the full job data and stack trace, queryable without connecting directly to Redis. Bull Board integration within the admin panel for operations teams who need to inspect or retry failed jobs without engineer involvement. Worker memory leak detection: Node.js --inspect flag with periodic heap snapshot comparison in staging to catch memory growth patterns before they cause worker OOMKills in production.
Event-driven architecture patterns implemented where appropriate: domain events published to a message bus (AWS EventBridge, Redis pub/sub, or RabbitMQ) to decouple cross-domain actions (user registered → welcome email, order placed → inventory decrement, payment succeeded → invoice generated) without tight coupling between services. Saga pattern for multi-step distributed transactions that require compensating actions if a step fails (order creation saga: reserve inventory → charge payment → create fulfilment record; if payment fails, inventory reservation is released via a compensating action). Outbox pattern for reliable event delivery: the domain event is written to an outbox table in the same database transaction as the business record, then a background process reads the outbox and publishes to the message bus -- guaranteeing events are never lost even if the message bus is temporarily unavailable at the time of the write.