Gateway Tracing & Observability — Developer Guide

Overview

The Ngwenya API Gateway provides four layers of observability that work together to monitor performance, reduce backend load, and optimize network throughput.

Layer	Purpose	Backend
Response Caching	Cache full GraphQL responses for identical queries	Redis-backed `KeyValueCache`
Automatic Persisted Queries (APQ)	Reduce request size — clients send a hash, not full query text	Redis-backed hash store
Tracing Plugin	Per-operation timing, error tracking, and classification	In-memory aggregation
Metrics Export	Admin visibility into operation performance	REST endpoints (JSON + Prometheus)

All four layers are implemented in the apps/ngwenya-gateway/src/tracing/ module and are independently toggleable via environment variables.

Architecture

The tracing module sits alongside the existing rate-limit/ and deprecation/ modules:

File	Purpose
`tracing.module.ts`	NestJS module wiring all providers
`tracing.constants.ts`	Env var keys, defaults, Redis key prefixes
`tracing.plugin.ts`	Apollo Server lifecycle hooks for timing/errors
`cache.config.ts`	Redis cache adapter + factory functions for response cache and APQ
`metrics.service.ts`	In-memory aggregation + Prometheus text export
`metrics.controller.ts`	REST admin endpoints (metrics, health)

Request Flow

graph TD
    A["Incoming GraphQL Request"] --> B{"APQ Enabled?"}
    B -->|Yes| C{"Hash in Redis?"}
    B -->|No| F["Parse full query"]
    C -->|Hit| D["Resolve query from hash"]
    C -->|Miss| E["Store hash → query in Redis"]
    D --> F
    E --> F
    F --> G{"Response Cache Enabled?"}
    G -->|Yes| H{"Cached result exists?"}
    G -->|No| K["Execute resolvers"]
    H -->|Hit ✅| I["Return cached response"]
    H -->|Miss| K
    K --> L["TracingPlugin records metrics"]
    L --> M{"Has @cacheControl hints?"}
    M -->|Yes| N["Store in response cache"]
    M -->|No| O["Skip caching"]
    N --> P["Return response"]
    O --> P
    I --> P

    style I fill:#22c55e,color:#fff
    style P fill:#22c55e,color:#fff
    style D fill:#3b82f6,color:#fff
    style N fill:#3b82f6,color:#fff

Response Caching

The gateway uses @graphql-yoga/plugin-response-cache with a Redis-backed cache to cache complete GraphQL responses.

How It Works

A Visitor or Buyer sends a query (e.g., GetProducts).
The plugin checks if an identical response is already cached.
Cache hit → Returns immediately without executing any resolvers.
Cache miss → Resolvers execute, and the response is cached for future requests (if cache-eligible).

Cache Eligibility

Responses are only cached when all resolved fields have @cacheControl hints with maxAge > 0. The gateway's safe default is maxAge: 0 — meaning nothing is cached unless explicitly opted in at the subgraph schema level.

# In a subgraph schema — opt a type into response caching
type Product @cacheControl(maxAge: 300) {
	id: ID!
	name: String!
	price: Money!
}

# Per-field override — short cache for dynamic data
type Malet @cacheControl(maxAge: 60) {
	id: ID!
	name: String!
	activeVisitors: Int @cacheControl(maxAge: 10)
}

Session-Aware Caching (PRIVATE Scope)

The plugin supports per-user caching for responses marked with scope: PRIVATE:

type Query {
	myUCart: UCart @cacheControl(maxAge: 30, scope: PRIVATE)
}

The gateway identifies sessions by extracting the authenticated user's ID from the GraphQL context. Anonymous Visitors share the PUBLIC cache; authenticated Buyers get their own PRIVATE cache partition.

Redis Key Format

rc:{query-hash}:{variables-hash}     # PUBLIC scope
rc:{query-hash}:{variables-hash}:{userId}  # PRIVATE scope

Multi-instance safe: Because the cache is Redis-backed, all gateway instances share the same cache. A cache entry written by instance A is immediately available to instance B.

Automatic Persisted Queries (APQ)

APQ is a network optimization built into GraphQL Yoga via the @graphql-yoga/plugin-apq plugin. Instead of sending the full query text on every request, clients send a SHA-256 hash (typically 64 characters) instead of the full query string (which can be thousands of characters).

How It Works

Step 1: Client sends hash only
  → Gateway checks Redis for hash
  → Miss: returns "PersistedQueryNotFound"

Step 2: Client sends hash + full query text
  → Gateway stores hash→query mapping in Redis
  → Executes query normally

Step 3+: Client sends hash only
  → Gateway finds query in Redis
  → Executes query — no full text needed

Why It Matters

Benefit	Impact
Smaller requests	Hash (64 chars) vs full query (500–5000 chars)
HTTP GET eligible	Hashed queries fit in URLs, enabling CDN caching
Survives restarts	Redis-backed cache persists across gateway deploys
Zero client effort	Apollo Client enables APQ with one config line

Frontend Configuration

If the frontend uses Apollo Client, enable APQ with the persisted query link:

import { createPersistedQueryLink } from '@apollo/client/link/persisted-queries';
import { sha256 } from 'crypto-hash';

const link = createPersistedQueryLink({
	sha256,
	useGETForHashedQueries: true // enables CDN edge caching
});

Redis Key Format

apq:{sha256-hash}

Tracing Plugin

The TracingPlugin hooks into GraphQL Yoga's request lifecycle to capture per-operation diagnostics:

What It Tracks

Metric	Description
Operation name	e.g., `GetProducts`, `CreateMurchase`, `anonymous`
Operation type	`query`, `mutation`, `subscription`
Wall-clock latency	Start-to-response time in milliseconds
Error count	Number of errors per operation
Min / Max / Avg latency	Aggregated over the operation's lifetime
P95 latency	95th percentile from sampled latencies

What It Skips

Introspection queries (IntrospectionQuery) — not meaningful for metrics.
When disabled — returns empty lifecycle hooks with zero overhead.

Plugin Lifecycle

// The plugin hooks into two Apollo lifecycle events:

requestDidStart → {
  // 1. didResolveOperation — filters introspection queries
  // 2. willSendResponse — records timing + error data
}

Metrics Export

The MetricsService provides in-memory aggregation with three REST admin endpoints:

`GET /admin/metrics`

Returns operation-level metrics as JSON:

{
  "global": {
    "totalRequests": 1247,
    "totalErrors": 23,
    "totalCacheHits": 412,
    "totalCacheMisses": 835,
    "apqHits": 1100,
    "apqMisses": 147,
    "uptimeMs": 3600000
  },
  "operations": [
    {
      "operationName": "GetProducts",
      "operationType": "query",
      "count": 892,
      "totalDurationMs": 15764,
      "minDurationMs": 3,
      "maxDurationMs": 245,
      "errorCount": 2,
      "cacheHits": 380,
      "latencies": [...]
    }
  ],
  "operationCount": 15,
  "collectedAt": "2026-04-05T18:00:00.000Z"
}

`GET /admin/metrics/prometheus`

Returns metrics in Prometheus text exposition format — compatible with Prometheus, Grafana, and any OpenMetrics scraper:

# HELP gql_requests_total Total GraphQL requests
# TYPE gql_requests_total counter
gql_requests_total 1247

# HELP gql_errors_total Total GraphQL errors
# TYPE gql_errors_total counter
gql_errors_total 23

# HELP gql_cache_hits_total Response cache hits
# TYPE gql_cache_hits_total counter
gql_cache_hits_total 412

# HELP gql_apq_hits_total APQ cache hits
# TYPE gql_apq_hits_total counter
gql_apq_hits_total 1100

# HELP gql_operation_count Requests per operation
# TYPE gql_operation_count counter
gql_operation_count{operation="GetProducts",type="query"} 892
gql_operation_count{operation="CreateMurchase",type="mutation"} 156

`POST /admin/metrics/reset`

Resets all counters to zero. Useful for testing or after deploying a new version:

curl -X POST http://localhost:30000/admin/metrics/reset
# { "message": "Metrics counters reset" }

Note: Metrics are stored in-memory and reset on gateway restart. This is intentional — the same pattern used by the deprecation tracking module. Redis-backed persistence can be added later if historical data is needed.

`GET /admin/health`

Returns process health and memory stats, used by the soak test harness for memory leak detection:

{
  "status": "ok",
  "pid": 12345,
  "uptime": 3600.5,
  "memory": {
    "rss": 256901120,
    "heapTotal": 174063616,
    "heapUsed": 169410560,
    "external": 8294400,
    "arrayBuffers": 4651008
  }
}

This endpoint is polled at regular intervals during soak tests to build a memory timeline and detect heap growth patterns.

Environment Variables

All tracing features are independently toggleable with sensible defaults:

Variable	Default	Description
`TRACING_ENABLED`	`true`	Master toggle for the tracing plugin
`APQ_ENABLED`	`true`	Toggle Redis-backed APQ cache
`RESPONSE_CACHE_ENABLED`	`true`	Toggle Redis-backed response caching
`RESPONSE_CACHE_MAX_AGE`	`0`	Default PUBLIC max-age in seconds (0 = explicit hints only)
`METRICS_ENABLED`	`true`	Toggle metrics collection and admin endpoints

To disable response caching in a development environment:

RESPONSE_CACHE_ENABLED=false

To disable all tracing features at once:

TRACING_ENABLED=false

Monitoring with Grafana

Add the gateway as a scrape target in your Prometheus configuration:

scrape_configs:
  - job_name: 'ngwenya-gateway'
    metrics_path: '/admin/metrics/prometheus'
    static_configs:
      - targets: ['localhost:3000']

Read the Monitoring & Alerting Infrastructure guide to learn how to launch the pre-provisioned Grafana dashboard that visualizes these metrics.

Key dashboards to build:

Panel	Metric	Purpose
Request rate	`gql_requests_total`	Traffic volume
Error rate	`gql_errors_total / gql_requests_total`	Error percentage
Cache hit ratio	`gql_cache_hits_total / (gql_cache_hits_total + gql_cache_misses_total)`	Cache effectiveness
APQ hit ratio	`gql_apq_hits_total / (gql_apq_hits_total + gql_apq_misses_total)`	APQ adoption
Slowest operations	`gql_operation_duration_ms_avg`	Performance bottlenecks

Testing

Unit Tests

# Run all gateway unit tests (138+ tests, including tracing + ws-auth)
npm run test -- apps/ngwenya-gateway --forceExit

Covers: metrics aggregation, Prometheus export, health endpoint, tracing plugin lifecycle, Redis cache adapter, factory functions, controller endpoints.

E2E Tests

# Run gateway E2E tests (19 tests, including tracing admin endpoints)
npx jest --config apps/ngwenya-gateway/test/jest-e2e.json --forceExit

Covers: JSON metrics endpoint, Prometheus text endpoint, counter reset, operation accumulation.

Soak Test

The gateway includes a zero-dependency soak test harness for verifying memory stability under sustained load:

# Full 30-minute soak test
make soak-test

# Quick 5-minute validation
SOAK_DURATION_MINS=5 make soak-test

The soak test:

Resets gateway metrics and captures a baseline memory snapshot via /admin/health
Launches a Node.js load generator (soak-load.mjs) sending a realistic query mix at configurable RPS
Samples heap usage every 60 seconds via /admin/health
Produces a JSON report (logs/soak-report.json) and CSV memory timeline (logs/soak-memory.csv)
Verdict: PASS if heap growth < threshold (default 50%), FAIL if exceeding

Variable	Default	Description
`SOAK_DURATION_MINS`	`30`	Test duration
`SOAK_CONCURRENCY`	`10`	Concurrent request workers
`SOAK_RPS`	`50`	Target requests per second
`SOAK_HEAP_THRESHOLD`	`50`	Max allowed heap growth %

Gateway (Hive Gateway) — Federated gateway architecture, composition, benchmarks, and migration details
Gateway Rate Limiting — Companion gateway feature — dual-window throttling, mutation penalties, auth uplift
Debugging & Testing — curl workflows for tracing auth and order pipelines
Alerts Resilience & Delivery Tracking — Observability for the alerts pipeline (DLQ, retry, delivery logs)
Error Tracking (GlitchTip) — Client-side crash reporting — the frontend counterpart to gateway-level tracing

Gateway Tracing & Observability — Developer Guide

Overview

Architecture

Request Flow

Response Caching

How It Works

Cache Eligibility

Session-Aware Caching (PRIVATE Scope)

Redis Key Format

Automatic Persisted Queries (APQ)

How It Works

Why It Matters

Frontend Configuration

Redis Key Format

Tracing Plugin

What It Tracks

What It Skips

Plugin Lifecycle

Metrics Export

`GET /admin/metrics`

`GET /admin/metrics/prometheus`

`POST /admin/metrics/reset`

`GET /admin/health`

Environment Variables

Monitoring with Grafana

Testing

Unit Tests

E2E Tests

Soak Test

Related