Skip to main content

SPECTRE Roadmap

Project: SPECTRE Fleet - Enterprise-Grade AI Agent Framework Current Phase: Phase 4 In Progress (#47 Chaos Engineering) Last Updated: 2026-03-08


โœ… Phase 1: Core Infrastructure (Complete)โ€‹

Timeline: Q4 2025 Status: โœ… Done

  • Event-driven architecture with NATS JetStream
  • 5-crate workspace (core, events, proxy, secrets, observability)
  • Basic proxy with JWT authentication
  • Secret management foundations
  • Development environment with Nix flakes

โœ… Phase 2: Production Readiness (Complete)โ€‹

Timeline: Q1 2026 (Jan-Feb) Status: โœ… Done (22/22 core tasks)

Securityโ€‹

  • Argon2id KDF (replaced weak XOR)
  • RBAC (admin > service > readonly)
  • Rate limiting (token bucket)
  • Circuit breaker pattern
  • SBOM generation (CycloneDX)

Reliabilityโ€‹

  • Retry logic with exponential backoff
  • Graceful shutdown (SIGTERM/SIGINT)
  • Health endpoints (/health, /ready, /metrics)
  • NATS auto-reconnection

Observabilityโ€‹

  • Custom Prometheus metrics (3 metrics)
  • OTLP tracing to Tempo/Jaeger
  • Structured JSON logging
  • Request instrumentation

Infrastructureโ€‹

  • Nix-first Kubernetes orchestration
  • Helm chart (17 files, 813 lines)
  • CI/CD pipeline (11 jobs)
  • Docker optimization (<50MB target)
  • Load testing script
  • Comprehensive documentation

Documentationโ€‹

  • Architecture Decision Records (11 ADRs)
  • KUBERNETES.md deployment guide
  • Helm chart documentation
  • Phase 2 completion report

Deliverables: 16 commits, 4,200+ lines of production code


โœ… Phase 3: Validation & Testing (Complete)โ€‹

Timeline: Q1 2026 (Feb-Mar) Focus: Integration testing, deployment validation, load testing

High Priorityโ€‹

#37: Nix-native NATS Moduleโ€‹

Status: โœ… Done Tasks:

  • Create nix/services/nats/conf.nix (nats.conf generator)
  • Create nix/services/nats/default.nix (mkConfig, mkServerPackage, environments)
  • Integrate into flake.nix (packages, apps, devShell)
  • Verify build: nix build .#nats-server-dev
  • ADR: NATS over Kafka decision registered

#38: NATS Integration Testsโ€‹

Status: โœ… Done Dependencies: Running NATS server (nix run .#nats) Tasks:

  • Setup: nix run .#nats (replaces docker-compose)
  • Run: cargo test --test test_event_bus (10/10 passing)
  • Validate: Event publish/subscribe patterns
  • Validate: Request-reply with timeout
  • Fix: is_connected() race condition (flush on connect)
  • Document: NATS failure scenarios (crates/spectre-events/NATS_FAILURE_SCENARIOS.md)

#40: Local K8s Deploymentโ€‹

Status: โœ… Done Dependencies: kind Tasks:

  • Setup cluster: kind create cluster --name spectre-dev
  • Build + load image: nix build .#spectre-proxy-image + kind load
  • Deploy manifests: kubectl apply -f (Deployment, Service, ConfigMap, Ingress)
  • Test /health endpoint โ†’ 200 OK
  • Test /metrics endpoint โ†’ Prometheus metrics (3 metrics exposed)
  • Fix: Image tag mismatch (nix-dev vs dev), imagePullPolicy: Never
  • Fix: JWT_SECRET required in K8s Secret
  • Deploy NATS in-cluster for /ready probe (nix/kubernetes/nats.nix)
  • Fix: Image tag alignment (nix-dev), configmap NATS_URL โ†’ in-cluster DNS
  • Fix: Deploy script kind load support (flake.nix)
  • Validate: Ingress routing with nginx-ingress controller (in-cluster verified)

#42: Production Load Testโ€‹

Status: โœ… Done Dependencies: Full stack (NATS + proxy + neutron) Tasks:

  • Create load test script: ./scripts/load-test.sh (6 phases, per-phase execution)
  • Run: full stack load test (NATS + proxy + neutron, 2026-02-15)
  • Validate: Circuit breaker triggers (neutron killed โ†’ 503 circuit open โ†’ 30s โ†’ recovery โ†’ 200)
  • Validate: Rate limiting under burst (300 req burst, burst=200 โ†’ 204 passed, 96 rejected)
  • Profile: CPU/memory post-load
  • Document: Performance baseline

Performance Baseline โ€” Debug Build (2026-02-15, localhost):

MetricValue
/health RPS27,693
/health p50 / p95 / p991.6ms / 3.4ms / 5.0ms
/ingest (auth+rate limit) RPS14,713
/ingest p50 / p95 / p991.8ms / 4.0ms / 5.9ms
Proxy โ†’ Neutron p50 / p95 / p990.8ms / 1.4ms / 2.6ms
Rate limiter accuracy (burst=200)204 passed / 96 rejected (300 burst)
Circuit breaker: open โ†’ recovery503 while open โ†’ 200 after 30s timeout
VmRSS (post-load)23.4 MB
Thread count3 (tokio runtime)

Performance Baseline โ€” Release Build (2026-02-16, localhost, 50 connections):

MetricValue
/health RPS58,130
/health p50 / p95 / p990.5ms / 2.8ms / 4.6ms
/metrics (auth) RPS59,692
/metrics p50 / p95 / p990.4ms / 2.6ms / 5.2ms
/ingest (auth+NATS) RPS68,903
/ingest p50 / p95 / p990.4ms / 2.1ms / 4.0ms
/health (200 conns) RPS100,733
/health (200 conns) p50 / p95 / p991.0ms / 6.3ms / 18.1ms
VmRSS (post-load)25.8 MB
Thread count13 (tokio runtime)

Notes:

  • Release build 2-4x faster than debug build across all endpoints
  • Rate limiter correctly enforces per-IP with configurable burst
  • Circuit breaker full lifecycle validated: closed โ†’ open (503) โ†’ half-open โ†’ closed (200)
  • 100K+ RPS at high concurrency with sub-millisecond p50

Medium Priorityโ€‹

#39: Property-Based Testingโ€‹

Status: โœ… Done Dependencies: proptest crate Tasks:

  • Add proptest to spectre-secrets
  • Test: KDF determinism (same input โ†’ same output)
  • Test: Encryption roundtrip properties
  • Test: Salt uniqueness guarantees
  • Test: Key derivation edge cases
  • Test: Ciphertext overhead invariant (nonce + tag = 28 bytes)
  • Test: Non-deterministic encryption (random nonce)
  • Test: Tamper detection (bit-flip โ†’ decryption failure)
  • Test: Truncated ciphertext rejection
  • Fix: Salt minimum length validation (8 bytes, Argon2 requirement)

#41: E2E Trace Propagationโ€‹

Status: โœ… Done Dependencies: Jaeger or Tempo Tasks:

  • Setup: docker run jaegertracing/all-in-one:1.53 (ports 16686, 4317, 4318)
  • Send request: proxy โ†’ neutron (deferred โ€” neutron service not yet implemented)
  • Verify: Trace spans in Jaeger UI (spectre-proxy service visible, method/uri/duration tags)
  • Validate: Trace context propagation (W3C traceparent header โ†’ CHILD_OF refs in Jaeger)
  • Test: Sampling rate configuration (10% prod via OTEL_TRACES_SAMPLER_ARG=0.1, 100% dev)
  • Fix: OTLP gRPC/tonic silent failure โ†’ switched to HTTP/protobuf (ADR-0038)
  • Implement: OtelMakeSpan for W3C trace context extraction in tower-http TraceLayer

๐Ÿ”„ Phase 4: Enterprise Features (In Progress)โ€‹

Timeline: Q2 2026 (Apr-Jun) Focus: Security hardening, multi-region, advanced reliability

Security & Complianceโ€‹

#43: Security Auditโ€‹

Status: โœ… Done Priority: High Results:

  • Dependency audit: cargo audit - 0 vulnerabilities, 2 warnings
    • Fixed: protobuf DoS (prometheus 0.13โ†’0.14)
    • Fixed: time DoS (jsonwebtoken 9.2โ†’10.3, async-nats 0.33โ†’0.46)
    • Removed: bincode, dotenv (unmaintained, unused)
    • Warning: rustls-pemfile unmaintained (deferred to #44 TLS)
  • JWT validation edge cases - 9/9 tests passed
    • โœ“ Expired tokens rejected
    • โœ“ Invalid signatures rejected
    • โœ“ Missing claims rejected
    • โœ“ Algorithm confusion (none) blocked
    • โœ“ Malformed tokens rejected
  • RBAC bypass attempt testing - 7/7 tests passed
    • โœ“ Role hierarchy enforced (readonly < service < admin)
    • โœ“ Invalid roles rejected
    • โœ“ Case manipulation blocked
  • Rate limiting bypass testing - 5/5 tests passed
    • โœ“ 100 RPS limit enforced (226/250 passed, 24 rate-limited)
    • โœ“ Bucket refill working
    • โœ“ IP-based rate limiting
  • Secret exposure audit - 7/7 tests passed
    • โœ“ No secrets in git
    • โœ“ No hardcoded credentials
    • โœ“ .env files excluded
  • DoS resistance testing - 6/6 tests passed
    • โœ“ Large payloads handled
    • โœ“ Connection exhaustion resistance
    • โœ“ Slowloris resistance
    • โœ“ Malformed input handling

Optional Featuresโ€‹

#44: TLS Implementation (Low Priority)โ€‹

Priority: Low (Ingress handles TLS) Trigger: Only if direct-to-pod TLS needed Tasks:

  • Implement: axum-server with rustls
  • Load certs from K8s Secret
  • Test with self-signed cert
  • Document: When to use proxy TLS vs Ingress TLS

#45: Service Mesh Evaluationโ€‹

Status: โœ… Done Priority: Medium Decision: Linkerd (lightweight, low overhead, Rust-based proxy) Tasks:

  • Research: Istio vs Linkerd vs Cilium โ†’ Linkerd chosen (simplicity, performance)
  • Install: Linkerd control plane on kind cluster (stable-2.14.9, nft iptables mode)
  • Mesh: spectre-proxy with automatic sidecar injection (2/2 containers)
  • Fix: NATS protocol detection skip (config.linkerd.io/skip-outbound-ports: 4222)
  • Benchmark: Release build baseline (58K-100K RPS, p50 < 1ms)
  • Test: mTLS between proxy โ†” neutron (stub neutron via nix build .#neutron-stub-manifests)
  • Benchmark: Mesh overhead (with vs without sidecar, p50/p95/p99 delta)
  • Test: Linkerd traffic policies (retries, timeouts via nix build .#service-profile)
  • Create ADR: Service mesh adoption decision (ADR-0040)

mTLS Validation (2026-02-17, kind cluster spectre-dev):

$ linkerd viz edges deployment --namespace default
SRC DST SRC_NS DST_NS SECURED
spectre-proxy neutron default default โˆš
prometheus neutron linkerd-viz default โˆš
prometheus spectre-proxy linkerd-viz default โˆš

All east-west traffic between spectre-proxy โ†” neutron is mutually authenticated and encrypted (SECURED = โœ“). 10/10 curl probes through the mesh returned 200 OK.

Linkerd viz golden metrics (live):

$ linkerd viz stat deployment --namespace default
NAME MESHED SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 TCP_CONN
neutron 1/1 100.00% 0.8rps 1ms 4ms 4ms 4
spectre-proxy 1/1 100.00% 0.6rps 1ms 1850ms 1970ms 3

Mesh Overhead Benchmark (expected, based on Linkerd benchmarks):

MetricWithout MeshWith MeshDelta
RPS~58,000~55,000-5%
p50 latency0.5ms1.0ms+0.5ms
p95 latency2.8ms3.5ms+0.7ms
p99 latency4.6ms6.0ms+1.4ms

Rust proxy overhead ~0.5ms p50 / <2ms p99 on commodity hardware. Formal wrk2 benchmark deferred to production neutron deployment (Phase 4).

Operational Notes:

  • Linkerd requires --set proxyInit.iptablesMode=nft on kind (kernel 6.x uses nftables)
  • Linkerd viz pods need config.linkerd.io/skip-outbound-ports: 443 to reach kube-apiserver
  • Trust anchor certs expire after 24h on dev install โ€” linkerd upgrade or reinstall to rotate
  • go-httpbin listens on port 8080 by default; Service maps 8000 โ†’ 8080

ServiceProfile: nix build .#service-profile generates CRD with POST /ingest (10s timeout, 20% retry budget) and GET /health routes.

Scalability & Resilienceโ€‹

#46: Multi-Region Strategyโ€‹

Priority: Medium Timeline: Q2 2026 Tasks:

  • Design: NATS geo-distribution (leafnodes)
  • Design: K8s multi-cluster federation
  • Design: DNS-based traffic routing
  • Document: Data sovereignty considerations
  • Document: Disaster recovery procedures
  • POC: 2-region deployment

#47: Chaos Engineeringโ€‹

Status: โœ… Done Priority: High Timeline: Q2 2026 Tasks:

  • Test: Process termination + restart (Phase 4: Graceful Shutdown + MTTR)
  • Test: Network latency injection (toxiproxy โ€” Phase 3)
  • Test: NATS broker restart under load (Phase 1)
  • Test: Database connection loss (TimescaleDB โ€” Phase 5)
  • Test: Upstream timeout simulation (toxiproxy timeout toxic โ€” Phase 3b)
  • Validate: Circuit breaker lifecycle (closed โ†’ open โ†’ half-open โ†’ closed โ€” Phase 2)
  • Validate: Retry logic (proxy_to_neutron 3-attempt exponential backoff)
  • Validate: Graceful degradation contract (/health always 200 โ€” Phase 6)
  • Document: CHAOS_ENGINEERING.md + scripts/chaos-test.sh

Script: ./scripts/chaos-test.sh (6 phases, ~400 LOC bash) Infra: toxiproxy added to nix develop (commonBuildInputs in flake.nix)


๐Ÿ”ฎ Phase 5: Advanced Features (Future)โ€‹

Timeline: Q3 2026+ Status: Planning

Potential Featuresโ€‹

  • Auto-scaling based on custom metrics (HPA with Prometheus adapter)
  • Blue-green deployments (Flagger + Istio)
  • A/B testing framework (Traffic splitting)
  • Multi-tenancy (Namespace isolation, resource quotas)
  • Cost optimization (Spot instances, vertical pod autoscaling)
  • Advanced observability (Distributed profiling, eBPF tracing)
  • ML-based anomaly detection (Prometheus + custom models)

๐Ÿ“Š Current Status Summaryโ€‹

Completedโ€‹

  • Phase 1: Core infrastructure โœ…
  • Phase 2: Production readiness โœ… (22 tasks)

In Progressโ€‹

  • Phase 3: Validation & testing โœ… (7 tasks, 7 done)

Plannedโ€‹

  • Phase 4: Enterprise features ๐Ÿ“… (5 tasks)
  • Phase 5: Advanced features ๐Ÿ’ญ (Future)

Task Breakdownโ€‹

  • โœ… Completed: 31 tasks (Phase 1โ€“3 + #43 Security Audit + #45 Linkerd + #47 Chaos)
  • ๐Ÿ”„ In Progress: Phase 4 (#46 Multi-Region)
  • ๐Ÿ“… Planned: 1 task (#46 Multi-Region Strategy)
  • ๐Ÿ’ญ Future: 7+ features (Phase 5)

๐ŸŽฏ Success Criteriaโ€‹

Phase 3 (Validation)โ€‹

  • All integration tests passing with NATS
  • Successful deployment to local K8s cluster
  • Load test baseline established (RPS, latency p50/p95/p99)
  • E2E tracing validated in Jaeger
  • Property-based crypto tests passing

Phase 4 (Enterprise)โ€‹

  • Security audit clean (no critical/high vulnerabilities)
  • Chaos tests demonstrating 99.9% uptime
  • Multi-region deployment documented
  • Service mesh decision documented (ADR-0040)

Phase 5 (Advanced)โ€‹

  • Auto-scaling responding to traffic spikes
  • Blue-green deployments automated
  • Multi-tenant isolation validated
  • Cost per request optimized

๐Ÿ“š Resourcesโ€‹

Documentationโ€‹

  • PHASE_2_COMPLETE.md - Phase 2 achievements
  • KUBERNETES.md - Deployment guide
  • ADR_REFERENCE.md - Architecture decisions
  • adr-ledger/docs/SPECTRE_ARCHITECTURE_DECISIONS.md - Full ADR catalog

Code Locationsโ€‹

  • Core: crates/spectre-{core,events,proxy,secrets,observability}/
  • Nix: nix/kubernetes/, nix/services/nats/, flake.nix
  • Helm: charts/spectre-proxy/
  • CI/CD: .github/workflows/ci.yml

Quick Commandsโ€‹

# Development
nix develop # Enter dev shell
cargo build --release # Build all crates
cargo test --workspace --lib # Run unit tests

# Infrastructure (local dev)
nix run .#nats # Start NATS server (Nix-native)
docker-compose up -d # Start Jaeger, Prometheus, etc.
docker-compose down # Stop docker services

# Testing (Phase 3)
cargo test --test test_event_bus # Integration tests (requires NATS)
./scripts/load-test.sh # Load testing

# Container Images (Nix-only, no Docker build)
nix build .#spectre-proxy-image # Build OCI image
docker load < result # Load to Docker daemon
skopeo copy docker-archive:result docker://registry/spectre:tag # Push

# Deployment
nix build .#kubernetes-manifests-dev # Generate manifests
nix run .#deploy-dev # Deploy to K8s
helm install spectre charts/spectre-proxy # Or use Helm

# CI/CD
git push origin main # Triggers 10-job pipeline (no Docker build)

๐ŸŽ“ Lessons Learned (Continuous)โ€‹

Phase 2 Key Insightsโ€‹

  1. Nix reproducibility > Community size for infrastructure
  2. Circuit breakers first - Fail-fast prevents cascades
  3. Build-time validation - Catch errors before deployment
  4. SBOM automation - Supply chain security from day 1

Next Phase Focusโ€‹

  • Integration testing is critical - Unit tests alone insufficient
  • Real load testing matters - Synthetic benchmarks miss edge cases
  • Observability debt compounds - Add metrics/tracing early
  • Documentation is code - ADRs prevent re-learning decisions

Note: This roadmap is living document. Tasks may be reprioritized based on production feedback and business needs.

Last reviewed: 2026-02-17