CI/CD Pipeline & Production Infrastructure

The Ngwenya federation platform uses GitHub Actions for continuous integration and deployment. This document covers the pipeline architecture, runner hosting strategy, database backup infrastructure, and CORS production lockdown.

Pipeline Architecture

The platform has two repositories, each with its own CI pipeline:

Backend — `ngwenya-federation`

PR Quality Gate (`ci.yaml`)

Runs on every pull request targeting main. Three parallel jobs:

Job	Purpose	Est. Time
Lint & Types	`pnpm lint` + `pnpm build` (NestJS compile check)	~3 min
Unit Tests	`make test` with MongoDB/Postgres/Redis/Meilisearch service containers	~5 min
Security	vuln-regex-detector ReDoS scan	~1 min

The unit test job runs with full infrastructure services (MongoDB, Postgres, Redis, Meilisearch) via GitHub Actions service containers.

Merge Pipeline (`e2e.yaml`)

Runs on push to main (after PR merge). Executes the full E2E test suite with the same service containers:

make test-e2e  # Runs all NestJS subgraph E2E tests with --forceExit

Note: Rust services (auth, uchat, intelligence, tower) are not yet compiled in CI. E2E tests for NestJS subgraphs only. Rust cross-compilation will be added in a future iteration.

Frontend — `ngwenya-front`

PR Quality Gate (`ci.yaml`)

Runs on every pull request targeting main. Five jobs (E2E depends on build):

Job	Purpose	Est. Time
Lint	ESLint + Prettier format check	~1 min
Type Check	`svelte-kit sync` + `svelte-check`	~2 min
Build	`vite build` — validates production bundle	~3 min
Unit Tests	Vitest unit test suite	~1 min
E2E Tests	35 Playwright tests against dev server (Chromium)	~5 min

The E2E job installs Playwright Chromium and runs against the Vite dev server on port 4173. Tests use the established mockAuth pattern for backend isolation (see E2E Testing Infrastructure).

Documentation Portals — `ngwenya-dev` & `ngwenya-support`

Both Astro-powered portals have a build-check CI (ci.yaml) that runs astro build on every PR. This catches broken markdown, invalid frontmatter, remark plugin errors, and missing imports before merge.

Concurrency Controls

Backend PR: Cancels in-progress runs on the same PR (cancel-in-progress: true)
Backend E2E: Serializes runs on main (cancel-in-progress: false) to prevent race conditions
Frontend PR: Cancels in-progress runs on the same PR
Doc Portals: Cancel in-progress runs on the same PR

Staging Environment

The platform includes a self-contained staging environment mirroring the full production topology across all 5 repositories: ngwenya-federation, ngwenya-front, ngwenya-dev, ngwenya-support, and mall-design.

Staging Topology

The staging environment is orchestrated via docker-compose-staging.yaml in the backend repository and includes 34 containers:

19 NestJS Services: Containerized using multi-stage production builds.
4 Rust Services: Containerized release binaries.
Frontend: SvelteKit containerized via @sveltejs/adapter-node.
Doc Portals: Both ngwenya-dev and ngwenya-support compiled to static HTML and served via nginx:alpine.
Infrastructure: Postgres, Redis, MongoDB, Meilisearch, MinIO, and Matrix Conduit.
Observability: GlitchTip error tracking, Prometheus time-series metrics, Grafana dashboards, and Tower health monitoring.

Mobile App Integration

The native mobile apps (ngwenya-front/mobile) are not containerized but include build-time configurations to target the staging gateway:

iOS (Swift): Uses an #if STAGING conditional in NgwenyaViewModel.swift.
Android (Kotlin): Uses a custom staging build flavor exposing BuildConfig.GATEWAY_URL.

Operational Commands

The staging lifecycle is managed via Makefile targets in ngwenya-federation:

make staging-up: Boots the full 34-container stack (~3.5 GB RAM).
make staging-up-lite: Boots the critical path only (~2.0 GB RAM).
make staging-health: Runs scripts/staging-healthcheck.sh to poll all services.
make staging-test: Runs the E2E suite against the staging gateway.
make staging-seed: Populates staging databases with test data.

Runner Hosting Strategy

Decision: GitHub-Hosted Runners (`ubuntu-latest`)

After evaluating GitHub's 2026 pricing changes, the platform uses GitHub-hosted runners for all CI/CD workflows.

Cost Analysis (as of January 2026)

Factor	GitHub-Hosted	Self-Hosted
Setup	Zero — instant	Provision VM, install agent, maintain
Linux cost	$0.008/min (39% reduction)	$0.002/min platform charge + infra
Free tier	2,000 min/month (Free), 3,000 (Pro)	Same minutes consumed
Maintenance	GitHub handles everything	Team handles OS, deps, patches
Scale	Auto-scales with demand	Manual capacity planning
Cold starts	~15-30s VM spin-up per job	Persistent = faster starts

Why GitHub-Hosted (For Now)

Zero operational overhead — no servers to patch, monitor, or scale
Free tier sufficient — at current commit frequency, monthly CI usage stays within the included 2,000-3,000 minutes
Predictable environments — eliminates "works on my machine" issues in CI
No specialized hardware needs — NestJS + Jest don't require GPU or ARM

When to Re-evaluate

Switch to self-hosted runners when any of these conditions are met:

Monthly CI minutes consistently exceed the free tier for 2+ months
CI jobs require specialized hardware (GPU, ARM Mac, high-memory)
Build times need to drop below what cold-start VMs allow
Security policy requires code to stay on private infrastructure

Action item: Review CI minute consumption after 3 months of pipeline data. The usage dashboard is at Settings → Billing → Actions in the GitHub repository.

Database Backup Strategy

The platform provides automated backup/restore for all databases via Makefile targets.

Quick Start

make backup-db           # Full backup (MongoDB + Postgres)
make backup-db-mongo     # MongoDB only
make backup-db-postgres  # Postgres only
make restore-db          # Restore from latest backup
make restore-db BACKUP=2026-05-07T15-20  # Restore specific backup

Storage Backends

Backups support pluggable storage backends via environment variables:

Backend	`BACKUP_STORAGE`	Required Config	Use Case
Local	`local` (default)	`BACKUP_LOCAL_DIR`	Development, quick snapshots
S3	`s3`	`BACKUP_S3_BUCKET`, AWS credentials	AWS production backups
R2	`r2`	`BACKUP_S3_BUCKET`, `BACKUP_S3_ENDPOINT`	Cloudflare R2 offsite backups

Adding a new storage backend requires adding an upload_*() function in scripts/backup-databases.sh and a case in the upload switch.

Automated Schedule

Backups run automatically via GitHub Actions (backup.yaml):

Setting	Value
Schedule	Daily at 3:00 AM UTC (`0 3 * * *`)
Manual trigger	`workflow_dispatch` — trigger on-demand from the Actions tab
Cloud storage	Configure via repository secrets: `BACKUP_S3_BUCKET`, `BACKUP_AWS_ACCESS_KEY_ID`, `BACKUP_AWS_SECRET_ACCESS_KEY`
Local fallback	If no cloud storage configured, backups are uploaded as GitHub Actions artifacts (30-day retention)

To change the schedule, edit the cron expression in .github/workflows/backup.yaml.

What Gets Backed Up

Database	Engine	Data
`ngwenya`	MongoDB	All collections (malets, products, orders, blogs, etc.)
`ngwenya_auth`	Postgres	Users, sessions, OAuth tokens, passkeys
`ngwenya_uchat`	Postgres	E2EE messages, conversations, participants
`ngwenya_scim`	Postgres	SCIM provisioning tokens, IdP configs

Retention

Local: Auto-prunes backups older than BACKUP_RETENTION_DAYS (default: 7 days)
Cloud: Managed by the S3/R2 bucket's lifecycle policy

See environment-variables.md for all backup-related env vars.

CORS Production Lockdown

CORS origins are environment-aware, controlled by NODE_ENV:

Development Mode (`NODE_ENV != 'production'`)

All localhost origins are allowed alongside production origins:

http://localhost:5173 (Vite dev server)
http://localhost:4321 (Astro docs portals)
http://localhost:3000 (alt dev port)

Production Mode (`NODE_ENV=production`)

Only Mallnline subdomain origins are allowed:

https://mallnline.com — The Lobby + Malets
https://uid.mallnline.com — uID (Universal Identity)
https://uchat.mallnline.com — uChat
https://umail.mallnline.com — uMail
https://ucart.mallnline.com — uCart Universal
https://deck.mallnline.com — The Deck (Malet Owner workspace)
https://studio.mallnline.com — The Studio (Developer workspace)
https://tower.mallnline.com — The Tower (Platform Admin workspace)

Adding Extra Origins

Use the CORS_EXTRA_ORIGINS env var to add staging or preview URLs without code changes:

CORS_EXTRA_ORIGINS=https://staging.mallnline.com,https://preview-123.vercel.app

CORS lockdown is applied in two locations:

Gateway (apps/ngwenya-gateway/src/main.ts)
Media service (apps/media/src/main.ts)

Internal subgraph-to-subgraph communication does not use CORS (gateway forwards headers internally via TCP/HTTP).

E2E Testing Infrastructure — Playwright testing architecture and patterns
Subdomain Auth Architecture — Domain separation strategy and cookie model
Gateway Tracing — Prometheus metrics, response cache, APQ
Database Backup & Disaster Recovery — Pluggable backup/restore scripts, scheduling, and storage backends
Staging Environment Architecture — Overview of the 34-container pre-production environment
Environment Variables — Full env var reference

CI/CD Pipeline & Production Infrastructure

Pipeline Architecture

Backend — `ngwenya-federation`

PR Quality Gate (`ci.yaml`)

Merge Pipeline (`e2e.yaml`)

Frontend — `ngwenya-front`

PR Quality Gate (`ci.yaml`)

Documentation Portals — `ngwenya-dev` & `ngwenya-support`

Concurrency Controls

Staging Environment

Staging Topology

Mobile App Integration

Operational Commands

Runner Hosting Strategy

Decision: GitHub-Hosted Runners (`ubuntu-latest`)

Cost Analysis (as of January 2026)

Why GitHub-Hosted (For Now)

When to Re-evaluate

Database Backup Strategy

Quick Start

Storage Backends

Automated Schedule

What Gets Backed Up

Retention

CORS Production Lockdown

Development Mode (`NODE_ENV != 'production'`)

Production Mode (`NODE_ENV=production`)

Adding Extra Origins

Related