Tower Health Observability
The Tower Health Observability system is a background Rust task inside the Tower subgraph that continuously probes all 34 platform services using a dual-layer architecture โ checking both process liveness (direct probes) and Gateway schema composition (federated probes). The results are exposed as federated GraphQL and rendered in the Tower admin dashboard as a real-time health grid.
Design Principle: The dual-layer approach was inspired by the existing
make smoke-testinfrastructure (scripts/test/smoke-test.sh). Where the CLI tests each subgraph through the Gateway and acceptsAuthentication requiredas a "reachable" signal, Tower does the same โ plus a direct probe to the process port for liveness.
Architecture
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Tower Health Monitor โ
โ (tokio::spawn background task) โ
โ โ
โ For each registered service: โ
โ โ
โ โโโโโ Direct Probe โโโโโ โโโโโ Federated Probe โโโโโ โ
โ โ POST {port}/graphql โ โ POST gateway:30000/graphqlโ โ
โ โ Body: smoke query โ โ Body: smoke query โ โ
โ โ Timeout: 8s โ โ (routed through Gateway) โ โ
โ โ โ 200 = UP โ โ Timeout: 8s โ โ
โ โ โ "Auth required" = UPโ โ โ 200 = UP โ โ
โ โ โ timeout/error = DOWNโ โ โ "Auth required" = UP โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโ โ โ timeout/error = DOWN โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ Result โ SharedHealthState (Arc<RwLock<PlatformHealthOverview>>)โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ Exposed via federated GraphQL
โโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ platformHealth โ โ PlatformHealthWidget โ
โ (GraphQL Query) โโโโโโถโ (Svelte 5 component) โ
โ via Hive Gateway โ โ Grouped grid + modal โ
โโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Diagnostic Matrix
The dual-layer design produces four diagnostic states that precisely isolate failure modes:
| Direct | Federated | Diagnosis | Action |
|---|---|---|---|
| โ UP | โ UP | Fully operational | No action needed |
| โ UP | โ DOWN | Process alive โ federation issue | Check logs/main.log for schema composition errors. The Gateway likely failed to compose this subgraph's schema. Run make restart-gateway to re-compose. |
| โ DOWN | โ UP | Stale gateway cache | Process crashed but the Gateway is serving cached responses. The cache will expire soon โ restart the service. |
| โ DOWN | โ DOWN | Service down | Check logs/{service}.log for crash output. Restart with make restart-{service} or make services-force. |
TIP
The UP/DOWN (direct alive, federated failing) state is the most actionable diagnostic โ it catches schema composition failures that make smoke-test alone would miss, since the CLI test only probes through the Gateway.
Smoke Query Mapping
Each subgraph uses a service-specific GraphQL query that mirrors the smoke test script. This ensures parity between CLI-based testing and the live dashboard:
| Service | Direct Query | Notes |
|---|---|---|
malets |
{ maletsTest } |
Dedicated health-check resolver |
products |
{ productsTest } |
Dedicated health-check resolver |
services |
{ servicesTest } |
Dedicated health-check resolver |
ucart |
{ ucartTest } |
Dedicated health-check resolver |
media |
{ mediaHealthCheck } |
Dedicated health-check resolver |
blogs |
{ blogsTest } |
Dedicated health-check resolver |
community |
{ communityTest } |
Dedicated health-check resolver |
organizations |
{ organizationsTest } |
Dedicated health-check resolver |
nodes |
{ nodesTest } |
Dedicated health-check resolver |
murchases |
order(id:"__tower__") |
Auth-error = reachable |
payments |
userSubscription(...) |
Auth-error = reachable |
search |
searchDocs(query:"__tower__") |
Uses unified search index |
alerts |
alertLog(id:"__tower__") |
Auth-error = reachable |
experiences |
{ myBookings { id } } |
Auth-error = reachable |
auth |
{ mySession { ... } } |
Auth-error = reachable |
uchat |
{ myConversations(...) } |
Auth-error = reachable. uChat uses Matrix/Conduit underneath |
intelligence |
{ forYouFeed(...) } |
Rust service โ same tech stack as Tower |
tower |
{ __typename } |
Self-check via introspection |
gateway |
{ __schema { queryType { name } } } |
Hive Gateway introspection |
imaging, video, crypto, scim |
{ __typename } |
Ancillary services not in the Gateway supergraph โ direct probe only |
Auth-Error Acceptance
For queries requiring authentication (murchases, payments, alerts, experiences, auth, uchat), the health monitor treats responses containing "Authentication required" or "Forbidden" as UP โ the presence of an auth error proves the process is alive and the GraphQL schema is composed.
Service Registry
All 34 monitored components are organized into five categories:
| Category | Services |
|---|---|
| Subgraphs (23) | malets, products, services, ucart, media, blogs, community, organizations, nodes, murchases, payments, search, alerts, experiences, imaging, video, crypto, scim, auth, uchat, intelligence, gateway, tower |
| Infrastructure (3) | PostgreSQL, Redis, MongoDB |
| Observability (5) | Meilisearch, MinIO, GlitchTip, Prometheus, Grafana |
| Frontends (1) | Frontend (SvelteKit) |
| Portals (2) | Dev Portal, Support Center |
Configuration
| Variable | Default | Description |
|---|---|---|
HEALTH_CHECK_INTERVAL |
30 |
Probe interval in seconds |
TOWER_TARGET_ENV |
local |
local for host-based probing (localhost ports), docker for container DNS |
IMPORTANT
TOWER_TARGET_ENV=local is required for development. In Docker/staging, set to docker so services resolve via container names instead of localhost.
Frontend Dashboard
The PlatformHealthWidget.svelte component renders the health data as a premium admin dashboard:
Layout
- Header: Shows cluster uptime percentage and aggregate KPI cards (Monitored / Healthy / Down)
- Grouped Grid: Services organized by category with collapsible sections and an
{n}/{total} onlinecounter - Dual-Status Badges: Each subgraph card displays
D(Direct) andF(Federated) micro-badges, color-coded green/red - Service Detail Modal: Click any card to see probe details, response times, timestamps, diagnostic message, and a link to the Developer Portal subgraph overview doc
Theming
All colors use the Mallnline design token system ($mall/tokens.css) โ var(--mall-text), var(--mall-surface), var(--mall-text-secondary) โ ensuring consistent rendering in both light and dark modes.
GraphQL Queries
query GetPlatformHealth {
platformHealth {
totalServices
healthyCount
unhealthyCount
degradedCount
uptimePercentage
checkedAt
services {
name
category
status
directStatus
federatedStatus
responseTimeMs
federatedResponseTimeMs
lastCheckedAt
lastSuccessAt
consecutiveFailures
target
}
}
}
Gateway Metrics (Detail Modal)
When viewing the gateway service, additional metrics are fetched:
query GetGatewayMetrics {
gatewayMetrics {
totalRequests
totalErrors
cacheHitRatio
apqHitRatio
uptimeFormatted
}
}
These metrics are sourced from the Gateway's GET /admin/health and Prometheus scrape endpoint.
Related
- Tower Subgraph โ Tier 1 overview of the Rust/Axum subgraph powering health data and error tracking
- Tower DevOps Monitoring โ TODO aggregation, git activity feed, and coverage heatmap widgets
- Workspaces & The Tower โ Tab architecture, access control, and analytics pipeline for The Tower admin dashboard
- Error Tracking (GlitchTip) โ Self-hosted crash reporting infrastructure integrated alongside the health monitor
- Gateway Tracing & Observability โ Prometheus metrics, response caching, and APQ that feed the gateway detail modal
- Monitoring & Alerting Infrastructure โ Prometheus + Grafana time-series stack and the broader observability architecture
- Local Development Environment โ Smart restart, smoke tests, and service orchestration that parallel the health monitor
- Platform Health Monitoring โ User-facing guide for reading the health dashboard