Monitoring & Alerting Infrastructure

The Ngwenya platform uses a multi-layered observability stack to ensure high availability and rapid incident response. The infrastructure is composed of three primary pillars:

Error Tracking: GlitchTip (Sentry-compatible) handles application crashes and exceptions.
Health Monitoring: A native Rust background worker inside the Tower subgraph polls all services for uptime.
Time-Series Metrics: Prometheus scrapes the GraphQL Gateway, and Grafana visualizes the data.

1. Tower Health Monitor

Instead of running a separate Node.js service (like Uptime Kuma), health monitoring is built natively into the Rust-based Tower subgraph.

Tower maintains a PlatformHealthOverview state via a background tokio::spawn task that polls the entire stack every 30 seconds (configurable via HEALTH_CHECK_INTERVAL).

What is Monitored?

Subgraphs (NestJS & Rust): Polled via HTTP POST to /graphql with {__typename}.
Infrastructure (DBs, Redis): Polled via TCP connect.
Portals & Frontends: Polled via HTTP GET.

GraphQL API

Health data is exposed through the federated graph on the Tower subgraph:

query GetHealth {
  platformHealth {
    uptimePercentage
    healthyCount
    unhealthyCount
    services {
      name
      category
      status
      responseTimeMs
      consecutiveFailures
    }
  }
}

This allows the Tower admin dashboard to display a unified view of both errors (from GlitchTip) and platform health.

2. Prometheus & Grafana

The Ngwenya Gateway natively exposes Prometheus metrics at GET /admin/metrics/prometheus. We run a lightweight Prometheus container to scrape this endpoint every 15 seconds.

Grafana is auto-provisioned with the Prometheus data source and a pre-built dashboard covering:

Total request rate
Error percentage
Cache and APQ hit ratios
Top 10 slowest operations
Gateway uptime

Accessing the Dashboard

Run make monitoring-up
Visit http://localhost:3100
The "Ngwenya Gateway" dashboard will load automatically (anonymous viewing enabled for local dev). Admin login is admin / admin.

3. Alerts Integration

When a service transitions from Up to Down (or vice-versa), the Tower Health Monitor dispatches a service_health_changed event.

This event is routed to the existing Alerts Subgraph, which handles delivery to platform administrators via the standard notification channels (Email, Push, In-App).

Tower Health Observability — Dual-layer probing architecture, diagnostic matrix, smoke query mapping, and frontend dashboard
Tower Subgraph — Rust/Axum subgraph running the health monitor and GlitchTip proxy
Error Tracking (GlitchTip) — Self-hosted crash reporting: Docker setup, frontend SDK, and DSN configuration
Gateway Tracing & Observability — Prometheus metrics, response caching, and APQ that feed Grafana dashboards
Platform Health Monitoring — User-facing guide for the health dashboard