Developer Docs

Tower Health Observability

The Tower Health Observability system is a background Rust task inside the Tower subgraph that continuously probes all 34 platform services using a dual-layer architecture โ€” checking both process liveness (direct probes) and Gateway schema composition (federated probes). The results are exposed as federated GraphQL and rendered in the Tower admin dashboard as a real-time health grid.

Design Principle: The dual-layer approach was inspired by the existing make smoke-test infrastructure (scripts/test/smoke-test.sh). Where the CLI tests each subgraph through the Gateway and accepts Authentication required as a "reachable" signal, Tower does the same โ€” plus a direct probe to the process port for liveness.


Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    Tower Health Monitor                         โ”‚
โ”‚                  (tokio::spawn background task)                 โ”‚
โ”‚                                                                 โ”‚
โ”‚  For each registered service:                                   โ”‚
โ”‚                                                                 โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€ Direct Probe โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€ Federated Probe โ”€โ”€โ”€โ”€โ”       โ”‚
โ”‚  โ”‚ POST {port}/graphql   โ”‚    โ”‚ POST gateway:30000/graphqlโ”‚      โ”‚
โ”‚  โ”‚ Body: smoke query     โ”‚    โ”‚ Body: smoke query        โ”‚      โ”‚
โ”‚  โ”‚ Timeout: 8s           โ”‚    โ”‚ (routed through Gateway) โ”‚      โ”‚
โ”‚  โ”‚ โœ“ 200 = UP            โ”‚    โ”‚ Timeout: 8s              โ”‚      โ”‚
โ”‚  โ”‚ โœ“ "Auth required" = UPโ”‚    โ”‚ โœ“ 200 = UP               โ”‚      โ”‚
โ”‚  โ”‚ โœ— timeout/error = DOWNโ”‚    โ”‚ โœ“ "Auth required" = UP   โ”‚      โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ”‚ โœ— timeout/error = DOWN   โ”‚      โ”‚
โ”‚                                โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ”‚
โ”‚                                                                 โ”‚
โ”‚  Result โ†’ SharedHealthState (Arc<RwLock<PlatformHealthOverview>>)โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
          โ”‚
          โ–ผ Exposed via federated GraphQL
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  platformHealth      โ”‚     โ”‚  PlatformHealthWidget     โ”‚
โ”‚  (GraphQL Query)     โ”‚โ”€โ”€โ”€โ”€โ–ถโ”‚  (Svelte 5 component)     โ”‚
โ”‚  via Hive Gateway    โ”‚     โ”‚  Grouped grid + modal     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Diagnostic Matrix

The dual-layer design produces four diagnostic states that precisely isolate failure modes:

Direct Federated Diagnosis Action
โœ… UP โœ… UP Fully operational No action needed
โœ… UP โŒ DOWN Process alive โ€” federation issue Check logs/main.log for schema composition errors. The Gateway likely failed to compose this subgraph's schema. Run make restart-gateway to re-compose.
โŒ DOWN โœ… UP Stale gateway cache Process crashed but the Gateway is serving cached responses. The cache will expire soon โ€” restart the service.
โŒ DOWN โŒ DOWN Service down Check logs/{service}.log for crash output. Restart with make restart-{service} or make services-force.

TIP

The UP/DOWN (direct alive, federated failing) state is the most actionable diagnostic โ€” it catches schema composition failures that make smoke-test alone would miss, since the CLI test only probes through the Gateway.


Smoke Query Mapping

Each subgraph uses a service-specific GraphQL query that mirrors the smoke test script. This ensures parity between CLI-based testing and the live dashboard:

Service Direct Query Notes
malets { maletsTest } Dedicated health-check resolver
products { productsTest } Dedicated health-check resolver
services { servicesTest } Dedicated health-check resolver
ucart { ucartTest } Dedicated health-check resolver
media { mediaHealthCheck } Dedicated health-check resolver
blogs { blogsTest } Dedicated health-check resolver
community { communityTest } Dedicated health-check resolver
organizations { organizationsTest } Dedicated health-check resolver
nodes { nodesTest } Dedicated health-check resolver
murchases order(id:"__tower__") Auth-error = reachable
payments userSubscription(...) Auth-error = reachable
search searchDocs(query:"__tower__") Uses unified search index
alerts alertLog(id:"__tower__") Auth-error = reachable
experiences { myBookings { id } } Auth-error = reachable
auth { mySession { ... } } Auth-error = reachable
uchat { myConversations(...) } Auth-error = reachable. uChat uses Matrix/Conduit underneath
intelligence { forYouFeed(...) } Rust service โ€” same tech stack as Tower
tower { __typename } Self-check via introspection
gateway { __schema { queryType { name } } } Hive Gateway introspection
imaging, video, crypto, scim { __typename } Ancillary services not in the Gateway supergraph โ€” direct probe only

Auth-Error Acceptance

For queries requiring authentication (murchases, payments, alerts, experiences, auth, uchat), the health monitor treats responses containing "Authentication required" or "Forbidden" as UP โ€” the presence of an auth error proves the process is alive and the GraphQL schema is composed.


Service Registry

All 34 monitored components are organized into five categories:

Category Services
Subgraphs (23) malets, products, services, ucart, media, blogs, community, organizations, nodes, murchases, payments, search, alerts, experiences, imaging, video, crypto, scim, auth, uchat, intelligence, gateway, tower
Infrastructure (3) PostgreSQL, Redis, MongoDB
Observability (5) Meilisearch, MinIO, GlitchTip, Prometheus, Grafana
Frontends (1) Frontend (SvelteKit)
Portals (2) Dev Portal, Support Center

Configuration

Variable Default Description
HEALTH_CHECK_INTERVAL 30 Probe interval in seconds
TOWER_TARGET_ENV local local for host-based probing (localhost ports), docker for container DNS

IMPORTANT

TOWER_TARGET_ENV=local is required for development. In Docker/staging, set to docker so services resolve via container names instead of localhost.


Frontend Dashboard

The PlatformHealthWidget.svelte component renders the health data as a premium admin dashboard:

Layout

  • Header: Shows cluster uptime percentage and aggregate KPI cards (Monitored / Healthy / Down)
  • Grouped Grid: Services organized by category with collapsible sections and an {n}/{total} online counter
  • Dual-Status Badges: Each subgraph card displays D (Direct) and F (Federated) micro-badges, color-coded green/red
  • Service Detail Modal: Click any card to see probe details, response times, timestamps, diagnostic message, and a link to the Developer Portal subgraph overview doc

Theming

All colors use the Mallnline design token system ($mall/tokens.css) โ€” var(--mall-text), var(--mall-surface), var(--mall-text-secondary) โ€” ensuring consistent rendering in both light and dark modes.

GraphQL Queries

query GetPlatformHealth {
  platformHealth {
    totalServices
    healthyCount
    unhealthyCount
    degradedCount
    uptimePercentage
    checkedAt
    services {
      name
      category
      status
      directStatus
      federatedStatus
      responseTimeMs
      federatedResponseTimeMs
      lastCheckedAt
      lastSuccessAt
      consecutiveFailures
      target
    }
  }
}

Gateway Metrics (Detail Modal)

When viewing the gateway service, additional metrics are fetched:

query GetGatewayMetrics {
  gatewayMetrics {
    totalRequests
    totalErrors
    cacheHitRatio
    apqHitRatio
    uptimeFormatted
  }
}

These metrics are sourced from the Gateway's GET /admin/health and Prometheus scrape endpoint.