CI/CD Pipeline & Production Infrastructure
The Ngwenya federation platform uses GitHub Actions for continuous integration and deployment. This document covers the pipeline architecture, runner hosting strategy, database backup infrastructure, and CORS production lockdown.
Pipeline Architecture
The platform has two repositories, each with its own CI pipeline:
Backend โ `ngwenya-federation`
PR Quality Gate (`ci.yaml`)
Runs on every pull request targeting main. Three parallel jobs:
| Job | Purpose | Est. Time |
|---|---|---|
| Lint & Types | pnpm lint + pnpm build (NestJS compile check) |
~3 min |
| Unit Tests | make test with MongoDB/Postgres/Redis/Meilisearch service containers |
~5 min |
| Security | vuln-regex-detector ReDoS scan | ~1 min |
The unit test job runs with full infrastructure services (MongoDB, Postgres, Redis, Meilisearch) via GitHub Actions service containers.
Merge Pipeline (`e2e.yaml`)
Runs on push to main (after PR merge). Executes the full E2E test suite with the same service containers:
make test-e2e # Runs all NestJS subgraph E2E tests with --forceExit
Note: Rust services (auth, uchat, intelligence, tower) are not yet compiled in CI. E2E tests for NestJS subgraphs only. Rust cross-compilation will be added in a future iteration.
Frontend โ `ngwenya-front`
PR Quality Gate (`ci.yaml`)
Runs on every pull request targeting main. Five jobs (E2E depends on build):
| Job | Purpose | Est. Time |
|---|---|---|
| Lint | ESLint + Prettier format check | ~1 min |
| Type Check | svelte-kit sync + svelte-check |
~2 min |
| Build | vite build โ validates production bundle |
~3 min |
| Unit Tests | Vitest unit test suite | ~1 min |
| E2E Tests | 35 Playwright tests against dev server (Chromium) | ~5 min |
The E2E job installs Playwright Chromium and runs against the Vite dev server on port 4173. Tests use the established mockAuth pattern for backend isolation (see E2E Testing Infrastructure).
Documentation Portals โ `ngwenya-dev` & `ngwenya-support`
Both Astro-powered portals have a build-check CI (ci.yaml) that runs astro build on every PR. This catches broken markdown, invalid frontmatter, remark plugin errors, and missing imports before merge.
Concurrency Controls
- Backend PR: Cancels in-progress runs on the same PR (
cancel-in-progress: true) - Backend E2E: Serializes runs on
main(cancel-in-progress: false) to prevent race conditions - Frontend PR: Cancels in-progress runs on the same PR
- Doc Portals: Cancel in-progress runs on the same PR
Staging Environment
The platform includes a self-contained staging environment mirroring the full production topology across all 5 repositories: ngwenya-federation, ngwenya-front, ngwenya-dev, ngwenya-support, and mall-design.
Staging Topology
The staging environment is orchestrated via docker-compose-staging.yaml in the backend repository and includes 34 containers:
- 19 NestJS Services: Containerized using multi-stage
productionbuilds. - 4 Rust Services: Containerized
releasebinaries. - Frontend: SvelteKit containerized via
@sveltejs/adapter-node. - Doc Portals: Both
ngwenya-devandngwenya-supportcompiled to static HTML and served vianginx:alpine. - Infrastructure: Postgres, Redis, MongoDB, Meilisearch, MinIO, and Matrix Conduit.
- Observability: GlitchTip error tracking, Prometheus time-series metrics, Grafana dashboards, and Tower health monitoring.
Mobile App Integration
The native mobile apps (ngwenya-front/mobile) are not containerized but include build-time configurations to target the staging gateway:
- iOS (Swift): Uses an
#if STAGINGconditional inNgwenyaViewModel.swift. - Android (Kotlin): Uses a custom
stagingbuild flavor exposingBuildConfig.GATEWAY_URL.
Operational Commands
The staging lifecycle is managed via Makefile targets in ngwenya-federation:
make staging-up: Boots the full 34-container stack (~3.5 GB RAM).make staging-up-lite: Boots the critical path only (~2.0 GB RAM).make staging-health: Runsscripts/staging-healthcheck.shto poll all services.make staging-test: Runs the E2E suite against the staging gateway.make staging-seed: Populates staging databases with test data.
Runner Hosting Strategy
Decision: GitHub-Hosted Runners (`ubuntu-latest`)
After evaluating GitHub's 2026 pricing changes, the platform uses GitHub-hosted runners for all CI/CD workflows.
Cost Analysis (as of January 2026)
| Factor | GitHub-Hosted | Self-Hosted |
|---|---|---|
| Setup | Zero โ instant | Provision VM, install agent, maintain |
| Linux cost | $0.008/min (39% reduction) | $0.002/min platform charge + infra |
| Free tier | 2,000 min/month (Free), 3,000 (Pro) | Same minutes consumed |
| Maintenance | GitHub handles everything | Team handles OS, deps, patches |
| Scale | Auto-scales with demand | Manual capacity planning |
| Cold starts | ~15-30s VM spin-up per job | Persistent = faster starts |
Why GitHub-Hosted (For Now)
- Zero operational overhead โ no servers to patch, monitor, or scale
- Free tier sufficient โ at current commit frequency, monthly CI usage stays within the included 2,000-3,000 minutes
- Predictable environments โ eliminates "works on my machine" issues in CI
- No specialized hardware needs โ NestJS + Jest don't require GPU or ARM
When to Re-evaluate
Switch to self-hosted runners when any of these conditions are met:
- Monthly CI minutes consistently exceed the free tier for 2+ months
- CI jobs require specialized hardware (GPU, ARM Mac, high-memory)
- Build times need to drop below what cold-start VMs allow
- Security policy requires code to stay on private infrastructure
Action item: Review CI minute consumption after 3 months of pipeline data. The usage dashboard is at Settings โ Billing โ Actions in the GitHub repository.
Database Backup Strategy
The platform provides automated backup/restore for all databases via Makefile targets.
Quick Start
make backup-db # Full backup (MongoDB + Postgres)
make backup-db-mongo # MongoDB only
make backup-db-postgres # Postgres only
make restore-db # Restore from latest backup
make restore-db BACKUP=2026-05-07T15-20 # Restore specific backup
Storage Backends
Backups support pluggable storage backends via environment variables:
| Backend | BACKUP_STORAGE |
Required Config | Use Case |
|---|---|---|---|
| Local | local (default) |
BACKUP_LOCAL_DIR |
Development, quick snapshots |
| S3 | s3 |
BACKUP_S3_BUCKET, AWS credentials |
AWS production backups |
| R2 | r2 |
BACKUP_S3_BUCKET, BACKUP_S3_ENDPOINT |
Cloudflare R2 offsite backups |
Adding a new storage backend requires adding an upload_*() function in scripts/backup-databases.sh and a case in the upload switch.
Automated Schedule
Backups run automatically via GitHub Actions (backup.yaml):
| Setting | Value |
|---|---|
| Schedule | Daily at 3:00 AM UTC (0 3 * * *) |
| Manual trigger | workflow_dispatch โ trigger on-demand from the Actions tab |
| Cloud storage | Configure via repository secrets: BACKUP_S3_BUCKET, BACKUP_AWS_ACCESS_KEY_ID, BACKUP_AWS_SECRET_ACCESS_KEY |
| Local fallback | If no cloud storage configured, backups are uploaded as GitHub Actions artifacts (30-day retention) |
To change the schedule, edit the cron expression in .github/workflows/backup.yaml.
What Gets Backed Up
| Database | Engine | Data |
|---|---|---|
ngwenya |
MongoDB | All collections (malets, products, orders, blogs, etc.) |
ngwenya_auth |
Postgres | Users, sessions, OAuth tokens, passkeys |
ngwenya_uchat |
Postgres | E2EE messages, conversations, participants |
ngwenya_scim |
Postgres | SCIM provisioning tokens, IdP configs |
Retention
- Local: Auto-prunes backups older than
BACKUP_RETENTION_DAYS(default: 7 days) - Cloud: Managed by the S3/R2 bucket's lifecycle policy
See environment-variables.md for all backup-related env vars.
CORS Production Lockdown
CORS origins are environment-aware, controlled by NODE_ENV:
Development Mode (`NODE_ENV != 'production'`)
All localhost origins are allowed alongside production origins:
http://localhost:5173(Vite dev server)http://localhost:4321(Astro docs portals)http://localhost:3000(alt dev port)
Production Mode (`NODE_ENV=production`)
Only Mallnline subdomain origins are allowed:
https://mallnline.comโ The Lobby + Maletshttps://uid.mallnline.comโ uID (Universal Identity)https://uchat.mallnline.comโ uChathttps://umail.mallnline.comโ uMailhttps://ucart.mallnline.comโ uCart Universalhttps://deck.mallnline.comโ The Deck (Malet Owner workspace)https://studio.mallnline.comโ The Studio (Developer workspace)https://tower.mallnline.comโ The Tower (Platform Admin workspace)
Adding Extra Origins
Use the CORS_EXTRA_ORIGINS env var to add staging or preview URLs without code changes:
CORS_EXTRA_ORIGINS=https://staging.mallnline.com,https://preview-123.vercel.app
CORS lockdown is applied in two locations:
- Gateway (
apps/ngwenya-gateway/src/main.ts) - Media service (
apps/media/src/main.ts)
Internal subgraph-to-subgraph communication does not use CORS (gateway forwards headers internally via TCP/HTTP).
Related
- E2E Testing Infrastructure โ Playwright testing architecture and patterns
- Subdomain Auth Architecture โ Domain separation strategy and cookie model
- Gateway Tracing โ Prometheus metrics, response cache, APQ
- Database Backup & Disaster Recovery โ Pluggable backup/restore scripts, scheduling, and storage backends
- Staging Environment Architecture โ Overview of the 34-container pre-production environment
- Environment Variables โ Full env var reference