Monitoring & Alerting Infrastructure
The Ngwenya platform uses a multi-layered observability stack to ensure high availability and rapid incident response. The infrastructure is composed of three primary pillars:
- Error Tracking: GlitchTip (Sentry-compatible) handles application crashes and exceptions.
- Health Monitoring: A native Rust background worker inside the
Towersubgraph polls all services for uptime. - Time-Series Metrics: Prometheus scrapes the GraphQL Gateway, and Grafana visualizes the data.
1. Tower Health Monitor
Instead of running a separate Node.js service (like Uptime Kuma), health monitoring is built natively into the Rust-based Tower subgraph.
Tower maintains a PlatformHealthOverview state via a background tokio::spawn task that polls the entire stack every 30 seconds (configurable via HEALTH_CHECK_INTERVAL).
What is Monitored?
- Subgraphs (NestJS & Rust): Polled via HTTP POST to
/graphqlwith{__typename}. - Infrastructure (DBs, Redis): Polled via TCP connect.
- Portals & Frontends: Polled via HTTP GET.
GraphQL API
Health data is exposed through the federated graph on the Tower subgraph:
query GetHealth {
platformHealth {
uptimePercentage
healthyCount
unhealthyCount
services {
name
category
status
responseTimeMs
consecutiveFailures
}
}
}
This allows the Tower admin dashboard to display a unified view of both errors (from GlitchTip) and platform health.
2. Prometheus & Grafana
The Ngwenya Gateway natively exposes Prometheus metrics at GET /admin/metrics/prometheus. We run a lightweight Prometheus container to scrape this endpoint every 15 seconds.
Grafana is auto-provisioned with the Prometheus data source and a pre-built dashboard covering:
- Total request rate
- Error percentage
- Cache and APQ hit ratios
- Top 10 slowest operations
- Gateway uptime
Accessing the Dashboard
- Run
make monitoring-up - Visit
http://localhost:3100 - The "Ngwenya Gateway" dashboard will load automatically (anonymous viewing enabled for local dev). Admin login is
admin/admin.
3. Alerts Integration
When a service transitions from Up to Down (or vice-versa), the Tower Health Monitor dispatches a service_health_changed event.
This event is routed to the existing Alerts Subgraph, which handles delivery to platform administrators via the standard notification channels (Email, Push, In-App).
Related
- Tower Health Observability โ Dual-layer probing architecture, diagnostic matrix, smoke query mapping, and frontend dashboard
- Tower Subgraph โ Rust/Axum subgraph running the health monitor and GlitchTip proxy
- Error Tracking (GlitchTip) โ Self-hosted crash reporting: Docker setup, frontend SDK, and DSN configuration
- Gateway Tracing & Observability โ Prometheus metrics, response caching, and APQ that feed Grafana dashboards
- Platform Health Monitoring โ User-facing guide for the health dashboard