Developer Docs

Gateway Tracing & Observability โ€” Developer Guide

Overview

The Ngwenya API Gateway provides four layers of observability that work together to monitor performance, reduce backend load, and optimize network throughput.

Layer Purpose Backend
Response Caching Cache full GraphQL responses for identical queries Redis-backed KeyValueCache
Automatic Persisted Queries (APQ) Reduce request size โ€” clients send a hash, not full query text Redis-backed hash store
Tracing Plugin Per-operation timing, error tracking, and classification In-memory aggregation
Metrics Export Admin visibility into operation performance REST endpoints (JSON + Prometheus)

All four layers are implemented in the apps/ngwenya-gateway/src/tracing/ module and are independently toggleable via environment variables.


Architecture

The tracing module sits alongside the existing rate-limit/ and deprecation/ modules:

File Purpose
tracing.module.ts NestJS module wiring all providers
tracing.constants.ts Env var keys, defaults, Redis key prefixes
tracing.plugin.ts Apollo Server lifecycle hooks for timing/errors
cache.config.ts Redis cache adapter + factory functions for response cache and APQ
metrics.service.ts In-memory aggregation + Prometheus text export
metrics.controller.ts REST admin endpoints (metrics, health)

Request Flow

graph TD
    A["Incoming GraphQL Request"] --> B{"APQ Enabled?"}
    B -->|Yes| C{"Hash in Redis?"}
    B -->|No| F["Parse full query"]
    C -->|Hit| D["Resolve query from hash"]
    C -->|Miss| E["Store hash โ†’ query in Redis"]
    D --> F
    E --> F
    F --> G{"Response Cache Enabled?"}
    G -->|Yes| H{"Cached result exists?"}
    G -->|No| K["Execute resolvers"]
    H -->|Hit โœ…| I["Return cached response"]
    H -->|Miss| K
    K --> L["TracingPlugin records metrics"]
    L --> M{"Has @cacheControl hints?"}
    M -->|Yes| N["Store in response cache"]
    M -->|No| O["Skip caching"]
    N --> P["Return response"]
    O --> P
    I --> P

    style I fill:#22c55e,color:#fff
    style P fill:#22c55e,color:#fff
    style D fill:#3b82f6,color:#fff
    style N fill:#3b82f6,color:#fff

Response Caching

The gateway uses @graphql-yoga/plugin-response-cache with a Redis-backed cache to cache complete GraphQL responses.

How It Works

  1. A Visitor or Buyer sends a query (e.g., GetProducts).
  2. The plugin checks if an identical response is already cached.
  3. Cache hit โ†’ Returns immediately without executing any resolvers.
  4. Cache miss โ†’ Resolvers execute, and the response is cached for future requests (if cache-eligible).

Cache Eligibility

Responses are only cached when all resolved fields have @cacheControl hints with maxAge > 0. The gateway's safe default is maxAge: 0 โ€” meaning nothing is cached unless explicitly opted in at the subgraph schema level.

# In a subgraph schema โ€” opt a type into response caching
type Product @cacheControl(maxAge: 300) {
	id: ID!
	name: String!
	price: Money!
}

# Per-field override โ€” short cache for dynamic data
type Malet @cacheControl(maxAge: 60) {
	id: ID!
	name: String!
	activeVisitors: Int @cacheControl(maxAge: 10)
}

Session-Aware Caching (PRIVATE Scope)

The plugin supports per-user caching for responses marked with scope: PRIVATE:

type Query {
	myUCart: UCart @cacheControl(maxAge: 30, scope: PRIVATE)
}

The gateway identifies sessions by extracting the authenticated user's ID from the GraphQL context. Anonymous Visitors share the PUBLIC cache; authenticated Buyers get their own PRIVATE cache partition.

Redis Key Format

rc:{query-hash}:{variables-hash}     # PUBLIC scope
rc:{query-hash}:{variables-hash}:{userId}  # PRIVATE scope

Multi-instance safe: Because the cache is Redis-backed, all gateway instances share the same cache. A cache entry written by instance A is immediately available to instance B.


Automatic Persisted Queries (APQ)

APQ is a network optimization built into GraphQL Yoga via the @graphql-yoga/plugin-apq plugin. Instead of sending the full query text on every request, clients send a SHA-256 hash (typically 64 characters) instead of the full query string (which can be thousands of characters).

How It Works

Step 1: Client sends hash only
  โ†’ Gateway checks Redis for hash
  โ†’ Miss: returns "PersistedQueryNotFound"

Step 2: Client sends hash + full query text
  โ†’ Gateway stores hashโ†’query mapping in Redis
  โ†’ Executes query normally

Step 3+: Client sends hash only
  โ†’ Gateway finds query in Redis
  โ†’ Executes query โ€” no full text needed

Why It Matters

Benefit Impact
Smaller requests Hash (64 chars) vs full query (500โ€“5000 chars)
HTTP GET eligible Hashed queries fit in URLs, enabling CDN caching
Survives restarts Redis-backed cache persists across gateway deploys
Zero client effort Apollo Client enables APQ with one config line

Frontend Configuration

If the frontend uses Apollo Client, enable APQ with the persisted query link:

import { createPersistedQueryLink } from '@apollo/client/link/persisted-queries';
import { sha256 } from 'crypto-hash';

const link = createPersistedQueryLink({
	sha256,
	useGETForHashedQueries: true // enables CDN edge caching
});

Redis Key Format

apq:{sha256-hash}

Tracing Plugin

The TracingPlugin hooks into GraphQL Yoga's request lifecycle to capture per-operation diagnostics:

What It Tracks

Metric Description
Operation name e.g., GetProducts, CreateMurchase, anonymous
Operation type query, mutation, subscription
Wall-clock latency Start-to-response time in milliseconds
Error count Number of errors per operation
Min / Max / Avg latency Aggregated over the operation's lifetime
P95 latency 95th percentile from sampled latencies

What It Skips

  • Introspection queries (IntrospectionQuery) โ€” not meaningful for metrics.
  • When disabled โ€” returns empty lifecycle hooks with zero overhead.

Plugin Lifecycle

// The plugin hooks into two Apollo lifecycle events:

requestDidStart โ†’ {
  // 1. didResolveOperation โ€” filters introspection queries
  // 2. willSendResponse โ€” records timing + error data
}

Metrics Export

The MetricsService provides in-memory aggregation with three REST admin endpoints:

`GET /admin/metrics`

Returns operation-level metrics as JSON:

{
  "global": {
    "totalRequests": 1247,
    "totalErrors": 23,
    "totalCacheHits": 412,
    "totalCacheMisses": 835,
    "apqHits": 1100,
    "apqMisses": 147,
    "uptimeMs": 3600000
  },
  "operations": [
    {
      "operationName": "GetProducts",
      "operationType": "query",
      "count": 892,
      "totalDurationMs": 15764,
      "minDurationMs": 3,
      "maxDurationMs": 245,
      "errorCount": 2,
      "cacheHits": 380,
      "latencies": [...]
    }
  ],
  "operationCount": 15,
  "collectedAt": "2026-04-05T18:00:00.000Z"
}

`GET /admin/metrics/prometheus`

Returns metrics in Prometheus text exposition format โ€” compatible with Prometheus, Grafana, and any OpenMetrics scraper:

# HELP gql_requests_total Total GraphQL requests
# TYPE gql_requests_total counter
gql_requests_total 1247

# HELP gql_errors_total Total GraphQL errors
# TYPE gql_errors_total counter
gql_errors_total 23

# HELP gql_cache_hits_total Response cache hits
# TYPE gql_cache_hits_total counter
gql_cache_hits_total 412

# HELP gql_apq_hits_total APQ cache hits
# TYPE gql_apq_hits_total counter
gql_apq_hits_total 1100

# HELP gql_operation_count Requests per operation
# TYPE gql_operation_count counter
gql_operation_count{operation="GetProducts",type="query"} 892
gql_operation_count{operation="CreateMurchase",type="mutation"} 156

`POST /admin/metrics/reset`

Resets all counters to zero. Useful for testing or after deploying a new version:

curl -X POST http://localhost:30000/admin/metrics/reset
# { "message": "Metrics counters reset" }

Note: Metrics are stored in-memory and reset on gateway restart. This is intentional โ€” the same pattern used by the deprecation tracking module. Redis-backed persistence can be added later if historical data is needed.

`GET /admin/health`

Returns process health and memory stats, used by the soak test harness for memory leak detection:

{
  "status": "ok",
  "pid": 12345,
  "uptime": 3600.5,
  "memory": {
    "rss": 256901120,
    "heapTotal": 174063616,
    "heapUsed": 169410560,
    "external": 8294400,
    "arrayBuffers": 4651008
  }
}

This endpoint is polled at regular intervals during soak tests to build a memory timeline and detect heap growth patterns.


Environment Variables

All tracing features are independently toggleable with sensible defaults:

Variable Default Description
TRACING_ENABLED true Master toggle for the tracing plugin
APQ_ENABLED true Toggle Redis-backed APQ cache
RESPONSE_CACHE_ENABLED true Toggle Redis-backed response caching
RESPONSE_CACHE_MAX_AGE 0 Default PUBLIC max-age in seconds (0 = explicit hints only)
METRICS_ENABLED true Toggle metrics collection and admin endpoints

To disable response caching in a development environment:

RESPONSE_CACHE_ENABLED=false

To disable all tracing features at once:

TRACING_ENABLED=false

Monitoring with Grafana

  1. Add the gateway as a scrape target in your Prometheus configuration:

    scrape_configs:
      - job_name: 'ngwenya-gateway'
        metrics_path: '/admin/metrics/prometheus'
        static_configs:
          - targets: ['localhost:3000']
    
  2. Read the Monitoring & Alerting Infrastructure guide to learn how to launch the pre-provisioned Grafana dashboard that visualizes these metrics.

Key dashboards to build:

Panel Metric Purpose
Request rate gql_requests_total Traffic volume
Error rate gql_errors_total / gql_requests_total Error percentage
Cache hit ratio gql_cache_hits_total / (gql_cache_hits_total + gql_cache_misses_total) Cache effectiveness
APQ hit ratio gql_apq_hits_total / (gql_apq_hits_total + gql_apq_misses_total) APQ adoption
Slowest operations gql_operation_duration_ms_avg Performance bottlenecks

Testing

Unit Tests

# Run all gateway unit tests (138+ tests, including tracing + ws-auth)
npm run test -- apps/ngwenya-gateway --forceExit

Covers: metrics aggregation, Prometheus export, health endpoint, tracing plugin lifecycle, Redis cache adapter, factory functions, controller endpoints.

E2E Tests

# Run gateway E2E tests (19 tests, including tracing admin endpoints)
npx jest --config apps/ngwenya-gateway/test/jest-e2e.json --forceExit

Covers: JSON metrics endpoint, Prometheus text endpoint, counter reset, operation accumulation.

Soak Test

The gateway includes a zero-dependency soak test harness for verifying memory stability under sustained load:

# Full 30-minute soak test
make soak-test

# Quick 5-minute validation
SOAK_DURATION_MINS=5 make soak-test

The soak test:

  1. Resets gateway metrics and captures a baseline memory snapshot via /admin/health
  2. Launches a Node.js load generator (soak-load.mjs) sending a realistic query mix at configurable RPS
  3. Samples heap usage every 60 seconds via /admin/health
  4. Produces a JSON report (logs/soak-report.json) and CSV memory timeline (logs/soak-memory.csv)
  5. Verdict: PASS if heap growth < threshold (default 50%), FAIL if exceeding
Variable Default Description
SOAK_DURATION_MINS 30 Test duration
SOAK_CONCURRENCY 10 Concurrent request workers
SOAK_RPS 50 Target requests per second
SOAK_HEAP_THRESHOLD 50 Max allowed heap growth %