Gateway Tracing & Observability โ Developer Guide
Overview
The Ngwenya API Gateway provides four layers of observability that work together to monitor performance, reduce backend load, and optimize network throughput.
| Layer | Purpose | Backend |
|---|---|---|
| Response Caching | Cache full GraphQL responses for identical queries | Redis-backed KeyValueCache |
| Automatic Persisted Queries (APQ) | Reduce request size โ clients send a hash, not full query text | Redis-backed hash store |
| Tracing Plugin | Per-operation timing, error tracking, and classification | In-memory aggregation |
| Metrics Export | Admin visibility into operation performance | REST endpoints (JSON + Prometheus) |
All four layers are implemented in the apps/ngwenya-gateway/src/tracing/ module and are independently toggleable via environment variables.
Architecture
The tracing module sits alongside the existing rate-limit/ and deprecation/ modules:
| File | Purpose |
|---|---|
tracing.module.ts |
NestJS module wiring all providers |
tracing.constants.ts |
Env var keys, defaults, Redis key prefixes |
tracing.plugin.ts |
Apollo Server lifecycle hooks for timing/errors |
cache.config.ts |
Redis cache adapter + factory functions for response cache and APQ |
metrics.service.ts |
In-memory aggregation + Prometheus text export |
metrics.controller.ts |
REST admin endpoints (metrics, health) |
Request Flow
graph TD
A["Incoming GraphQL Request"] --> B{"APQ Enabled?"}
B -->|Yes| C{"Hash in Redis?"}
B -->|No| F["Parse full query"]
C -->|Hit| D["Resolve query from hash"]
C -->|Miss| E["Store hash โ query in Redis"]
D --> F
E --> F
F --> G{"Response Cache Enabled?"}
G -->|Yes| H{"Cached result exists?"}
G -->|No| K["Execute resolvers"]
H -->|Hit โ
| I["Return cached response"]
H -->|Miss| K
K --> L["TracingPlugin records metrics"]
L --> M{"Has @cacheControl hints?"}
M -->|Yes| N["Store in response cache"]
M -->|No| O["Skip caching"]
N --> P["Return response"]
O --> P
I --> P
style I fill:#22c55e,color:#fff
style P fill:#22c55e,color:#fff
style D fill:#3b82f6,color:#fff
style N fill:#3b82f6,color:#fff
Response Caching
The gateway uses @graphql-yoga/plugin-response-cache with a Redis-backed cache to cache complete GraphQL responses.
How It Works
- A Visitor or Buyer sends a query (e.g.,
GetProducts). - The plugin checks if an identical response is already cached.
- Cache hit โ Returns immediately without executing any resolvers.
- Cache miss โ Resolvers execute, and the response is cached for future requests (if cache-eligible).
Cache Eligibility
Responses are only cached when all resolved fields have @cacheControl hints with maxAge > 0. The gateway's safe default is maxAge: 0 โ meaning nothing is cached unless explicitly opted in at the subgraph schema level.
# In a subgraph schema โ opt a type into response caching
type Product @cacheControl(maxAge: 300) {
id: ID!
name: String!
price: Money!
}
# Per-field override โ short cache for dynamic data
type Malet @cacheControl(maxAge: 60) {
id: ID!
name: String!
activeVisitors: Int @cacheControl(maxAge: 10)
}
Session-Aware Caching (PRIVATE Scope)
The plugin supports per-user caching for responses marked with scope: PRIVATE:
type Query {
myUCart: UCart @cacheControl(maxAge: 30, scope: PRIVATE)
}
The gateway identifies sessions by extracting the authenticated user's ID from the GraphQL context. Anonymous Visitors share the PUBLIC cache; authenticated Buyers get their own PRIVATE cache partition.
Redis Key Format
rc:{query-hash}:{variables-hash} # PUBLIC scope
rc:{query-hash}:{variables-hash}:{userId} # PRIVATE scope
Multi-instance safe: Because the cache is Redis-backed, all gateway instances share the same cache. A cache entry written by instance A is immediately available to instance B.
Automatic Persisted Queries (APQ)
APQ is a network optimization built into GraphQL Yoga via the @graphql-yoga/plugin-apq plugin. Instead of sending the full query text on every request, clients send a SHA-256 hash (typically 64 characters) instead of the full query string (which can be thousands of characters).
How It Works
Step 1: Client sends hash only
โ Gateway checks Redis for hash
โ Miss: returns "PersistedQueryNotFound"
Step 2: Client sends hash + full query text
โ Gateway stores hashโquery mapping in Redis
โ Executes query normally
Step 3+: Client sends hash only
โ Gateway finds query in Redis
โ Executes query โ no full text needed
Why It Matters
| Benefit | Impact |
|---|---|
| Smaller requests | Hash (64 chars) vs full query (500โ5000 chars) |
| HTTP GET eligible | Hashed queries fit in URLs, enabling CDN caching |
| Survives restarts | Redis-backed cache persists across gateway deploys |
| Zero client effort | Apollo Client enables APQ with one config line |
Frontend Configuration
If the frontend uses Apollo Client, enable APQ with the persisted query link:
import { createPersistedQueryLink } from '@apollo/client/link/persisted-queries';
import { sha256 } from 'crypto-hash';
const link = createPersistedQueryLink({
sha256,
useGETForHashedQueries: true // enables CDN edge caching
});
Redis Key Format
apq:{sha256-hash}
Tracing Plugin
The TracingPlugin hooks into GraphQL Yoga's request lifecycle to capture per-operation diagnostics:
What It Tracks
| Metric | Description |
|---|---|
| Operation name | e.g., GetProducts, CreateMurchase, anonymous |
| Operation type | query, mutation, subscription |
| Wall-clock latency | Start-to-response time in milliseconds |
| Error count | Number of errors per operation |
| Min / Max / Avg latency | Aggregated over the operation's lifetime |
| P95 latency | 95th percentile from sampled latencies |
What It Skips
- Introspection queries (
IntrospectionQuery) โ not meaningful for metrics. - When disabled โ returns empty lifecycle hooks with zero overhead.
Plugin Lifecycle
// The plugin hooks into two Apollo lifecycle events:
requestDidStart โ {
// 1. didResolveOperation โ filters introspection queries
// 2. willSendResponse โ records timing + error data
}
Metrics Export
The MetricsService provides in-memory aggregation with three REST admin endpoints:
`GET /admin/metrics`
Returns operation-level metrics as JSON:
{
"global": {
"totalRequests": 1247,
"totalErrors": 23,
"totalCacheHits": 412,
"totalCacheMisses": 835,
"apqHits": 1100,
"apqMisses": 147,
"uptimeMs": 3600000
},
"operations": [
{
"operationName": "GetProducts",
"operationType": "query",
"count": 892,
"totalDurationMs": 15764,
"minDurationMs": 3,
"maxDurationMs": 245,
"errorCount": 2,
"cacheHits": 380,
"latencies": [...]
}
],
"operationCount": 15,
"collectedAt": "2026-04-05T18:00:00.000Z"
}
`GET /admin/metrics/prometheus`
Returns metrics in Prometheus text exposition format โ compatible with Prometheus, Grafana, and any OpenMetrics scraper:
# HELP gql_requests_total Total GraphQL requests
# TYPE gql_requests_total counter
gql_requests_total 1247
# HELP gql_errors_total Total GraphQL errors
# TYPE gql_errors_total counter
gql_errors_total 23
# HELP gql_cache_hits_total Response cache hits
# TYPE gql_cache_hits_total counter
gql_cache_hits_total 412
# HELP gql_apq_hits_total APQ cache hits
# TYPE gql_apq_hits_total counter
gql_apq_hits_total 1100
# HELP gql_operation_count Requests per operation
# TYPE gql_operation_count counter
gql_operation_count{operation="GetProducts",type="query"} 892
gql_operation_count{operation="CreateMurchase",type="mutation"} 156
`POST /admin/metrics/reset`
Resets all counters to zero. Useful for testing or after deploying a new version:
curl -X POST http://localhost:30000/admin/metrics/reset
# { "message": "Metrics counters reset" }
Note: Metrics are stored in-memory and reset on gateway restart. This is intentional โ the same pattern used by the deprecation tracking module. Redis-backed persistence can be added later if historical data is needed.
`GET /admin/health`
Returns process health and memory stats, used by the soak test harness for memory leak detection:
{
"status": "ok",
"pid": 12345,
"uptime": 3600.5,
"memory": {
"rss": 256901120,
"heapTotal": 174063616,
"heapUsed": 169410560,
"external": 8294400,
"arrayBuffers": 4651008
}
}
This endpoint is polled at regular intervals during soak tests to build a memory timeline and detect heap growth patterns.
Environment Variables
All tracing features are independently toggleable with sensible defaults:
| Variable | Default | Description |
|---|---|---|
TRACING_ENABLED |
true |
Master toggle for the tracing plugin |
APQ_ENABLED |
true |
Toggle Redis-backed APQ cache |
RESPONSE_CACHE_ENABLED |
true |
Toggle Redis-backed response caching |
RESPONSE_CACHE_MAX_AGE |
0 |
Default PUBLIC max-age in seconds (0 = explicit hints only) |
METRICS_ENABLED |
true |
Toggle metrics collection and admin endpoints |
To disable response caching in a development environment:
RESPONSE_CACHE_ENABLED=false
To disable all tracing features at once:
TRACING_ENABLED=false
Monitoring with Grafana
Add the gateway as a scrape target in your Prometheus configuration:
scrape_configs: - job_name: 'ngwenya-gateway' metrics_path: '/admin/metrics/prometheus' static_configs: - targets: ['localhost:3000']Read the Monitoring & Alerting Infrastructure guide to learn how to launch the pre-provisioned Grafana dashboard that visualizes these metrics.
Key dashboards to build:
| Panel | Metric | Purpose |
|---|---|---|
| Request rate | gql_requests_total |
Traffic volume |
| Error rate | gql_errors_total / gql_requests_total |
Error percentage |
| Cache hit ratio | gql_cache_hits_total / (gql_cache_hits_total + gql_cache_misses_total) |
Cache effectiveness |
| APQ hit ratio | gql_apq_hits_total / (gql_apq_hits_total + gql_apq_misses_total) |
APQ adoption |
| Slowest operations | gql_operation_duration_ms_avg |
Performance bottlenecks |
Testing
Unit Tests
# Run all gateway unit tests (138+ tests, including tracing + ws-auth)
npm run test -- apps/ngwenya-gateway --forceExit
Covers: metrics aggregation, Prometheus export, health endpoint, tracing plugin lifecycle, Redis cache adapter, factory functions, controller endpoints.
E2E Tests
# Run gateway E2E tests (19 tests, including tracing admin endpoints)
npx jest --config apps/ngwenya-gateway/test/jest-e2e.json --forceExit
Covers: JSON metrics endpoint, Prometheus text endpoint, counter reset, operation accumulation.
Soak Test
The gateway includes a zero-dependency soak test harness for verifying memory stability under sustained load:
# Full 30-minute soak test
make soak-test
# Quick 5-minute validation
SOAK_DURATION_MINS=5 make soak-test
The soak test:
- Resets gateway metrics and captures a baseline memory snapshot via
/admin/health - Launches a Node.js load generator (
soak-load.mjs) sending a realistic query mix at configurable RPS - Samples heap usage every 60 seconds via
/admin/health - Produces a JSON report (
logs/soak-report.json) and CSV memory timeline (logs/soak-memory.csv) - Verdict: PASS if heap growth < threshold (default 50%), FAIL if exceeding
| Variable | Default | Description |
|---|---|---|
SOAK_DURATION_MINS |
30 |
Test duration |
SOAK_CONCURRENCY |
10 |
Concurrent request workers |
SOAK_RPS |
50 |
Target requests per second |
SOAK_HEAP_THRESHOLD |
50 |
Max allowed heap growth % |
Related
- Gateway (Hive Gateway) โ Federated gateway architecture, composition, benchmarks, and migration details
- Gateway Rate Limiting โ Companion gateway feature โ dual-window throttling, mutation penalties, auth uplift
- Debugging & Testing โ curl workflows for tracing auth and order pipelines
- Alerts Resilience & Delivery Tracking โ Observability for the alerts pipeline (DLQ, retry, delivery logs)
- Error Tracking (GlitchTip) โ Client-side crash reporting โ the frontend counterpart to gateway-level tracing