Developer Docs

Monitoring & Alerting Infrastructure

The Ngwenya platform uses a multi-layered observability stack to ensure high availability and rapid incident response. The infrastructure is composed of three primary pillars:

  1. Error Tracking: GlitchTip (Sentry-compatible) handles application crashes and exceptions.
  2. Health Monitoring: A native Rust background worker inside the Tower subgraph polls all services for uptime.
  3. Time-Series Metrics: Prometheus scrapes the GraphQL Gateway, and Grafana visualizes the data.

1. Tower Health Monitor

Instead of running a separate Node.js service (like Uptime Kuma), health monitoring is built natively into the Rust-based Tower subgraph.

Tower maintains a PlatformHealthOverview state via a background tokio::spawn task that polls the entire stack every 30 seconds (configurable via HEALTH_CHECK_INTERVAL).

What is Monitored?

  • Subgraphs (NestJS & Rust): Polled via HTTP POST to /graphql with {__typename}.
  • Infrastructure (DBs, Redis): Polled via TCP connect.
  • Portals & Frontends: Polled via HTTP GET.

GraphQL API

Health data is exposed through the federated graph on the Tower subgraph:

query GetHealth {
  platformHealth {
    uptimePercentage
    healthyCount
    unhealthyCount
    services {
      name
      category
      status
      responseTimeMs
      consecutiveFailures
    }
  }
}

This allows the Tower admin dashboard to display a unified view of both errors (from GlitchTip) and platform health.

2. Prometheus & Grafana

The Ngwenya Gateway natively exposes Prometheus metrics at GET /admin/metrics/prometheus. We run a lightweight Prometheus container to scrape this endpoint every 15 seconds.

Grafana is auto-provisioned with the Prometheus data source and a pre-built dashboard covering:

  • Total request rate
  • Error percentage
  • Cache and APQ hit ratios
  • Top 10 slowest operations
  • Gateway uptime

Accessing the Dashboard

  1. Run make monitoring-up
  2. Visit http://localhost:3100
  3. The "Ngwenya Gateway" dashboard will load automatically (anonymous viewing enabled for local dev). Admin login is admin / admin.

3. Alerts Integration

When a service transitions from Up to Down (or vice-versa), the Tower Health Monitor dispatches a service_health_changed event.

This event is routed to the existing Alerts Subgraph, which handles delivery to platform administrators via the standard notification channels (Email, Push, In-App).