Alerts Resilience & Delivery Tracking โ Developer Guide
Overview
The Alerts service is the platform's notification engine โ it dispatches emails, SMS, and push notifications triggered by events from other subgraphs (Murchases, Services, Auth). The Resilience layer ensures no notification is lost during provider outages by introducing a Dead Letter Queue (DLQ) backed by MongoDB and a persistent AlertLog that tracks every delivery attempt.
| Component | Purpose | Backend |
|---|---|---|
| AlertLog Entity | Persistent record of every notification attempt | MongoDB (alert_logs collection) |
| DlqService | Dead Letter Queue with exponential back-off retry | MongoDB + @nestjs/schedule cron |
| AlertDeliveryService | Single orchestration layer for all outbound sends | NestJS injectable |
| AlertLogResolver | Admin-only GraphQL queries for delivery inspection | Yoga Federation subgraph |
Architecture
Every notification in the platform โ whether it's a Murchase confirmation email, an SMS to a Malet Owner, or a push notification about a Workroom action โ flows through the AlertDeliveryService. This guarantees that every send is logged and retryable.
graph TD
A["OrderAlertsController"] --> D["AlertDeliveryService"]
B["GmailAlertsController"] --> D
C["SmsAlertsController"] --> D
E["WorkroomAlertsController"] --> D
D --> F{"Channel Router"}
F -->|EMAIL| G["GmailAlertsService (Resend)"]
F -->|SMS| H["SmsAlertsService (Twilio)"]
F -->|PUSH| I["PushAlertsService"]
F -->|IN_APP| J["AlertLog (IN_APP)"]
D --> K["DlqService"]
K --> L[("MongoDB: alert_logs")]
M["Cron: every 5 min"] --> D
D -->|"retry FAILED logs"| K
style L fill:#3b82f6,color:#fff
style D fill:#22c55e,color:#fff
style M fill:#f59e0b,color:#fff
Two-Tier Retry Strategy
The system uses two independent retry layers to handle both transient blips and prolonged outages:
| Layer | Scope | Strategy | Max Attempts | Base Delay |
|---|---|---|---|---|
| RetryService (in-request) | Transient network errors | Exponential back-off with jitter | 3 | 1 second |
| DlqService (cross-request) | Full provider outages | Cron-based exponential back-off | 5 | 1 minute |
If the in-request RetryService exhausts its 3 attempts, the AlertDeliveryService catches the failure and marks the AlertLog as FAILED. The DLQ cron then picks it up on the next tick (every 5 minutes) and re-dispatches it.
AlertLog Entity
Every delivery attempt creates a single AlertLog document in the alert_logs collection.
Schema
@modelOptions({
schemaOptions: { timestamps: true, collection: 'alert_logs' }
})
@index({ status: 1, nextRetryAt: 1 }) // DLQ retry lookup
@index({ channel: 1 })
@index({ userId: 1 })
@index({ createdAt: 1 }, { expireAfterSeconds: 90 * 24 * 60 * 60 }) // 90-day TTL
export class AlertLog {
id: string; // nanoid
channel: AlertChannel; // EMAIL | SMS | PUSH | IN_APP
status: AlertStatus; // PENDING | DELIVERED | FAILED | DEAD
recipient: string; // email address, phone number, or push token
subject?: string; // email subject or push title
eventType: string; // e.g. 'order_status_changed', 'notify_email'
payload?: object; // frozen copy of the original event payload
providerRef?: string; // Resend message ID, Twilio SID, etc.
error?: string; // last error message
attempts: number; // delivery attempt count
userId?: string; // target user ID
nextRetryAt?: Date; // when the DLQ should next retry
createdAt: Date;
updatedAt: Date;
}
Enums
enum AlertChannel {
EMAIL = 'EMAIL',
SMS = 'SMS',
PUSH = 'PUSH',
IN_APP = 'IN_APP' // Live โ powers NotificationCenter.svelte via polling
}
enum AlertStatus {
PENDING = 'PENDING', // Log created, send in progress
DELIVERED = 'DELIVERED', // Provider confirmed delivery
FAILED = 'FAILED', // Send failed, queued for retry
DEAD = 'DEAD' // Max retries exhausted
}
Data Retention
The alert_logs collection uses a MongoDB TTL index set to 90 days. Documents are automatically pruned by MongoDB's background thread โ no application-level cleanup needed.
Why 90 days? Industry standard for operational notification logs. Long enough for debugging delivery issues and compliance audits, short enough to respect GDPR storage limitation principles.
Delivery Flow
Happy Path (Email Example)
// 1. Controller receives event
@EventPattern('notify_email')
async notifyEmail(data: NotifyEmailDto) {
// 2. Check user preferences
const shouldSend = await this.nodesClient.shouldSendEmail(data.userId);
if (!shouldSend) return;
// 3. Route through delivery orchestrator
await this.deliveryService.sendEmail({
email: data.email,
subject: 'Your Murchase Confirmation',
text: data.text,
eventType: 'notify_email',
userId: data.userId,
payload: { ...data },
});
}
Inside AlertDeliveryService
sendEmail(opts)
โโโ 1. dlqService.enqueue(PENDING) โ Creates AlertLog
โโโ 2. gmailService.notifyEmail() โ Calls Resend API (with RetryService)
โโโ 3a. Success โ dlqService.markDelivered(logId, providerRef)
โโโ 3b. Failure โ dlqService.markFailed(logId, error)
โโโ Sets nextRetryAt with exponential back-off
โโโ If attempts >= 5 โ status = DEAD
DLQ Retry Processor
Every 5 minutes, the cron job picks up FAILED logs whose nextRetryAt has passed:
@Cron('0 */5 * * * *')
async processRetryBatch(): Promise<void> {
const retryable = await this.dlqService.getRetryable(25);
for (const log of retryable) {
switch (log.channel) {
case AlertChannel.EMAIL:
await this.attemptEmail(log, { /* reconstructed from payload */ });
break;
case AlertChannel.SMS:
await this.attemptSms(log, { ... });
break;
case AlertChannel.PUSH:
await this.attemptPush(log, { ... });
break;
}
}
}
Back-Off Schedule
| Attempt | Delay | Status if fails |
|---|---|---|
| 1 | 1 minute | FAILED |
| 2 | 2 minutes | FAILED |
| 3 | 4 minutes | FAILED |
| 4 | 8 minutes | FAILED |
| 5 | โ | DEAD (no more retries) |
GraphQL Queries (Admin)
The AlertLogResolver exposes read-only queries for admin/debugging. All queries are protected by GqlAuthGuard.
alertLogs
Paginated list with optional channel and status filters:
query {
alertLogs(filter: { channel: EMAIL, status: FAILED, first: 20, after: "cursor-id" }) {
id
channel
status
recipient
subject
eventType
attempts
error
providerRef
createdAt
nextRetryAt
}
}
alertLog
Single log lookup by ID:
query {
alertLog(id: "abc123") {
id
channel
status
recipient
payload
attempts
error
}
}
dlqSummary
Aggregate counts by status โ useful for dashboards:
query {
dlqSummary {
total
statusCounts {
status
count
}
}
}
Example response:
{
"data": {
"dlqSummary": {
"total": 1247,
"statusCounts": [
{ "status": "DELIVERED", "count": 1200 },
{ "status": "FAILED", "count": 40 },
{ "status": "DEAD", "count": 5 },
{ "status": "PENDING", "count": 2 }
]
}
}
}
Channel Integration
Email (Resend)
The GmailAlertsService sends transactional emails via the Resend API. It supports vertical-specific templates (Restaurant, Tour, Photography) that adapt the layout and content to the Malet's business type.
SMS (Twilio)
The SmsAlertsService sends SMS via Twilio. Used for time-sensitive notifications like Murchase confirmations and Workroom action reminders.
Push (FCM)
The PushAlertsService sends push notifications to mobile/web clients. Push tokens are fetched from the nodes subgraph via the NodesClientService.
In-App (Live)
The IN_APP channel creates AlertLog entries with channel: IN_APP and publishes them through the in-process PubSub for instant WebSocket delivery via the federated gateway (port 30000). The frontend notificationStore.svelte.ts receives these via graphql-ws subscription in real time, with automatic fallback to 30-second polling if the WS connection is unavailable. See Notification Connection Modes for the full transport architecture. Used by Organization invitations (org_invite_created), Murchase confirmations (order_created), Community assignment events, and uChat messages (new_message). See Invite & Notification Pipeline for the full cross-subgraph flow.
Environment Variables
| Variable | Default | Description |
|---|---|---|
DLQ_MAX_ATTEMPTS |
5 |
Max delivery attempts before moving to DEAD |
ALERT_LOG_TTL_DAYS |
90 |
Days before MongoDB auto-prunes alert logs |
ALERTS_SERVICE_PORT_TCP |
โ | Required. TCP port for microservice communication |
NODES_SERVICE_HOST |
localhost |
Nodes service host for preference lookups |
NODES_SERVICE_PORT_TCP |
3011 |
Nodes service TCP port |
Module Structure
apps/alerts/src/alert-log/
โโโ alert-log.entity.ts # AlertLog + enums
โโโ alert-log.module.ts # NestJS module wiring
โโโ alert-log.resolver.ts # GraphQL admin queries (federation)
โโโ alert-log.resolver.spec.ts # Resolver unit tests
โโโ alert-delivery.service.ts # Orchestrator + PubSub publish
โโโ alert-delivery.service.spec.ts
โโโ dlq.service.ts # Dead Letter Queue
โโโ dlq.service.spec.ts
โโโ notification.pubsub.ts # Shared PubSub instance
โโโ notification.pubsub.spec.ts # PubSub + filter unit tests
โโโ notification-subscription.resolver.ts # WS subscription (Yoga federated)
โโโ subscription.module.ts # Subscription module
โโโ dto/
โ โโโ alert-log-filter.input.ts # Query filter input
โ โโโ dlq-summary.type.ts # Summary response type
โโโ index.ts # Barrel exports
Testing
Unit Tests
# Run all alerts unit tests (125 tests across 18 suites)
npm run test -- apps/alerts --no-coverage
Key test coverage:
- DlqService: enqueue, markDelivered, markFailed (including DEAD transition), getRetryable, getSummary, cursor pagination
- AlertDeliveryService: sendEmail/sendSms/sendPush success + failure paths, DLQ retry batch for all channels
- AlertLogResolver: alertLogs pagination + filter cap, alertLog by ID, dlqSummary aggregation
E2E Tests
# Run alerts E2E tests (17 tests across 5 suites)
npx jest --config apps/alerts/test/jest-e2e.json --detectOpenHandles
Covers: email/SMS/push delivery tracking, GraphQL query responses, and auth guard enforcement.
Cross-Service Integration
How Other Subgraphs Trigger Notifications
Subgraphs emit events via TCP microservice transport. The alerts service listens with @EventPattern:
// From murchases service โ order creation
this.alertsClient.emit('order_status_changed', {
orderId: order.id,
buyerId: order.buyerId,
newStatus: OrderStatus.SHIPPED,
previousStatus: OrderStatus.PROCESSING,
verticalSlug: order.verticalSlug
});
The alerts controller receives the event, checks user preferences via NodesClientService, and routes through AlertDeliveryService โ creating an AlertLog for every notification attempt.
Dependency Map
| Alerts depends on | For |
|---|---|
nodes (TCP) |
User preferences, email, push tokens, quiet hours |
| Resend API | Email delivery |
| Twilio API | SMS delivery |
| MongoDB | AlertLog persistence |
| Other services depend on Alerts | Via |
|---|---|
murchases |
order_status_changed, order_created events |
services |
booking_confirmed events |
malets |
notify_email (contact form) |
organizations |
org_invite_created, org_invite_sms events |
Related
- Mobile Push Infrastructure โ In-house APNs/FCM push dispatch, native app shells, and push token lifecycle
- Alerts & Notifications โ Frontend toast system, session warnings, and activity monitoring
- Alert Templates โ Frontend template maker for authoring and previewing email/SMS templates
- Invite & Notification Pipeline โ Cross-subgraph flow for Organization invitations with emailโuserId resolution
- Community Orchestration โ Issue/Discussion assignment workflows that dispatch TCP events to the alerts service
- Gateway Tracing & Observability โ Companion observability layer for the API gateway
- Revenue Sharing & Payouts โ Payout notifications dispatched through the alerts delivery pipeline
- Notification Connection Modes โ WebSocket vs polling transport, connection state machine, and dual-server topology
- Workspaces & The Tower โ Alerts Analytics sub-tab in The Tower consuming
alertLogsanddlqSummary - uMail Domain Verification โ DNS setup (SPF/DKIM/DMARC) for
mallnline.comResend domain verification - Error Tracking (GlitchTip) โ Client-side crash reporting โ the frontend counterpart to server-side alert delivery tracking