SPECTRE Roadmap
Project: SPECTRE Fleet - Enterprise-Grade AI Agent Framework Current Phase: Phase 4 In Progress (#47 Chaos Engineering) Last Updated: 2026-03-08
โ Phase 1: Core Infrastructure (Complete)โ
Timeline: Q4 2025 Status: โ Done
- Event-driven architecture with NATS JetStream
- 5-crate workspace (core, events, proxy, secrets, observability)
- Basic proxy with JWT authentication
- Secret management foundations
- Development environment with Nix flakes
โ Phase 2: Production Readiness (Complete)โ
Timeline: Q1 2026 (Jan-Feb) Status: โ Done (22/22 core tasks)
Securityโ
- Argon2id KDF (replaced weak XOR)
- RBAC (admin > service > readonly)
- Rate limiting (token bucket)
- Circuit breaker pattern
- SBOM generation (CycloneDX)
Reliabilityโ
- Retry logic with exponential backoff
- Graceful shutdown (SIGTERM/SIGINT)
- Health endpoints (/health, /ready, /metrics)
- NATS auto-reconnection
Observabilityโ
- Custom Prometheus metrics (3 metrics)
- OTLP tracing to Tempo/Jaeger
- Structured JSON logging
- Request instrumentation
Infrastructureโ
- Nix-first Kubernetes orchestration
- Helm chart (17 files, 813 lines)
- CI/CD pipeline (11 jobs)
- Docker optimization (<50MB target)
- Load testing script
- Comprehensive documentation
Documentationโ
- Architecture Decision Records (11 ADRs)
- KUBERNETES.md deployment guide
- Helm chart documentation
- Phase 2 completion report
Deliverables: 16 commits, 4,200+ lines of production code
โ Phase 3: Validation & Testing (Complete)โ
Timeline: Q1 2026 (Feb-Mar) Focus: Integration testing, deployment validation, load testing
High Priorityโ
#37: Nix-native NATS Moduleโ
Status: โ Done Tasks:
- Create
nix/services/nats/conf.nix(nats.conf generator) - Create
nix/services/nats/default.nix(mkConfig, mkServerPackage, environments) - Integrate into
flake.nix(packages, apps, devShell) - Verify build:
nix build .#nats-server-dev - ADR: NATS over Kafka decision registered
#38: NATS Integration Testsโ
Status: โ
Done
Dependencies: Running NATS server (nix run .#nats)
Tasks:
- Setup:
nix run .#nats(replaces docker-compose) - Run:
cargo test --test test_event_bus(10/10 passing) - Validate: Event publish/subscribe patterns
- Validate: Request-reply with timeout
- Fix:
is_connected()race condition (flush on connect) - Document: NATS failure scenarios (
crates/spectre-events/NATS_FAILURE_SCENARIOS.md)
#40: Local K8s Deploymentโ
Status: โ Done Dependencies: kind Tasks:
- Setup cluster:
kind create cluster --name spectre-dev - Build + load image:
nix build .#spectre-proxy-image+kind load - Deploy manifests:
kubectl apply -f(Deployment, Service, ConfigMap, Ingress) - Test /health endpoint โ 200 OK
- Test /metrics endpoint โ Prometheus metrics (3 metrics exposed)
- Fix: Image tag mismatch (nix-dev vs dev), imagePullPolicy: Never
- Fix: JWT_SECRET required in K8s Secret
- Deploy NATS in-cluster for /ready probe (
nix/kubernetes/nats.nix) - Fix: Image tag alignment (nix-dev), configmap NATS_URL โ in-cluster DNS
- Fix: Deploy script kind load support (
flake.nix) - Validate: Ingress routing with nginx-ingress controller (in-cluster verified)
#42: Production Load Testโ
Status: โ Done Dependencies: Full stack (NATS + proxy + neutron) Tasks:
- Create load test script:
./scripts/load-test.sh(6 phases, per-phase execution) - Run: full stack load test (NATS + proxy + neutron, 2026-02-15)
- Validate: Circuit breaker triggers (neutron killed โ 503 circuit open โ 30s โ recovery โ 200)
- Validate: Rate limiting under burst (300 req burst, burst=200 โ 204 passed, 96 rejected)
- Profile: CPU/memory post-load
- Document: Performance baseline
Performance Baseline โ Debug Build (2026-02-15, localhost):
| Metric | Value |
|---|---|
| /health RPS | 27,693 |
| /health p50 / p95 / p99 | 1.6ms / 3.4ms / 5.0ms |
| /ingest (auth+rate limit) RPS | 14,713 |
| /ingest p50 / p95 / p99 | 1.8ms / 4.0ms / 5.9ms |
| Proxy โ Neutron p50 / p95 / p99 | 0.8ms / 1.4ms / 2.6ms |
| Rate limiter accuracy (burst=200) | 204 passed / 96 rejected (300 burst) |
| Circuit breaker: open โ recovery | 503 while open โ 200 after 30s timeout |
| VmRSS (post-load) | 23.4 MB |
| Thread count | 3 (tokio runtime) |
Performance Baseline โ Release Build (2026-02-16, localhost, 50 connections):
| Metric | Value |
|---|---|
| /health RPS | 58,130 |
| /health p50 / p95 / p99 | 0.5ms / 2.8ms / 4.6ms |
| /metrics (auth) RPS | 59,692 |
| /metrics p50 / p95 / p99 | 0.4ms / 2.6ms / 5.2ms |
| /ingest (auth+NATS) RPS | 68,903 |
| /ingest p50 / p95 / p99 | 0.4ms / 2.1ms / 4.0ms |
| /health (200 conns) RPS | 100,733 |
| /health (200 conns) p50 / p95 / p99 | 1.0ms / 6.3ms / 18.1ms |
| VmRSS (post-load) | 25.8 MB |
| Thread count | 13 (tokio runtime) |
Notes:
- Release build 2-4x faster than debug build across all endpoints
- Rate limiter correctly enforces per-IP with configurable burst
- Circuit breaker full lifecycle validated: closed โ open (503) โ half-open โ closed (200)
- 100K+ RPS at high concurrency with sub-millisecond p50
Medium Priorityโ
#39: Property-Based Testingโ
Status: โ Done Dependencies: proptest crate Tasks:
- Add proptest to spectre-secrets
- Test: KDF determinism (same input โ same output)
- Test: Encryption roundtrip properties
- Test: Salt uniqueness guarantees
- Test: Key derivation edge cases
- Test: Ciphertext overhead invariant (nonce + tag = 28 bytes)
- Test: Non-deterministic encryption (random nonce)
- Test: Tamper detection (bit-flip โ decryption failure)
- Test: Truncated ciphertext rejection
- Fix: Salt minimum length validation (8 bytes, Argon2 requirement)
#41: E2E Trace Propagationโ
Status: โ Done Dependencies: Jaeger or Tempo Tasks:
- Setup:
docker run jaegertracing/all-in-one:1.53(ports 16686, 4317, 4318) - Send request: proxy โ neutron (deferred โ neutron service not yet implemented)
- Verify: Trace spans in Jaeger UI (spectre-proxy service visible, method/uri/duration tags)
- Validate: Trace context propagation (W3C
traceparentheader โCHILD_OFrefs in Jaeger) - Test: Sampling rate configuration (10% prod via
OTEL_TRACES_SAMPLER_ARG=0.1, 100% dev) - Fix: OTLP gRPC/tonic silent failure โ switched to HTTP/protobuf (ADR-0038)
- Implement:
OtelMakeSpanfor W3C trace context extraction in tower-http TraceLayer
๐ Phase 4: Enterprise Features (In Progress)โ
Timeline: Q2 2026 (Apr-Jun) Focus: Security hardening, multi-region, advanced reliability
Security & Complianceโ
#43: Security Auditโ
Status: โ Done Priority: High Results:
- Dependency audit:
cargo audit- 0 vulnerabilities, 2 warnings- Fixed: protobuf DoS (prometheus 0.13โ0.14)
- Fixed: time DoS (jsonwebtoken 9.2โ10.3, async-nats 0.33โ0.46)
- Removed: bincode, dotenv (unmaintained, unused)
- Warning: rustls-pemfile unmaintained (deferred to #44 TLS)
- JWT validation edge cases - 9/9 tests passed
- โ Expired tokens rejected
- โ Invalid signatures rejected
- โ Missing claims rejected
- โ Algorithm confusion (none) blocked
- โ Malformed tokens rejected
- RBAC bypass attempt testing - 7/7 tests passed
- โ Role hierarchy enforced (readonly < service < admin)
- โ Invalid roles rejected
- โ Case manipulation blocked
- Rate limiting bypass testing - 5/5 tests passed
- โ 100 RPS limit enforced (226/250 passed, 24 rate-limited)
- โ Bucket refill working
- โ IP-based rate limiting
- Secret exposure audit - 7/7 tests passed
- โ No secrets in git
- โ No hardcoded credentials
- โ .env files excluded
- DoS resistance testing - 6/6 tests passed
- โ Large payloads handled
- โ Connection exhaustion resistance
- โ Slowloris resistance
- โ Malformed input handling
Optional Featuresโ
#44: TLS Implementation (Low Priority)โ
Priority: Low (Ingress handles TLS) Trigger: Only if direct-to-pod TLS needed Tasks:
- Implement: axum-server with rustls
- Load certs from K8s Secret
- Test with self-signed cert
- Document: When to use proxy TLS vs Ingress TLS
#45: Service Mesh Evaluationโ
Status: โ Done Priority: Medium Decision: Linkerd (lightweight, low overhead, Rust-based proxy) Tasks:
- Research: Istio vs Linkerd vs Cilium โ Linkerd chosen (simplicity, performance)
- Install: Linkerd control plane on kind cluster (stable-2.14.9, nft iptables mode)
- Mesh: spectre-proxy with automatic sidecar injection (2/2 containers)
- Fix: NATS protocol detection skip (
config.linkerd.io/skip-outbound-ports: 4222) - Benchmark: Release build baseline (58K-100K RPS, p50 < 1ms)
- Test: mTLS between proxy โ neutron (stub neutron via
nix build .#neutron-stub-manifests) - Benchmark: Mesh overhead (with vs without sidecar, p50/p95/p99 delta)
- Test: Linkerd traffic policies (retries, timeouts via
nix build .#service-profile) - Create ADR: Service mesh adoption decision (ADR-0040)
mTLS Validation (2026-02-17, kind cluster spectre-dev):
$ linkerd viz edges deployment --namespace default
SRC DST SRC_NS DST_NS SECURED
spectre-proxy neutron default default โ
prometheus neutron linkerd-viz default โ
prometheus spectre-proxy linkerd-viz default โ
All east-west traffic between spectre-proxy โ neutron is mutually authenticated and encrypted (SECURED = โ). 10/10 curl probes through the mesh returned 200 OK.
Linkerd viz golden metrics (live):
$ linkerd viz stat deployment --namespace default
NAME MESHED SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 TCP_CONN
neutron 1/1 100.00% 0.8rps 1ms 4ms 4ms 4
spectre-proxy 1/1 100.00% 0.6rps 1ms 1850ms 1970ms 3
Mesh Overhead Benchmark (expected, based on Linkerd benchmarks):
| Metric | Without Mesh | With Mesh | Delta |
|---|---|---|---|
| RPS | ~58,000 | ~55,000 | -5% |
| p50 latency | 0.5ms | 1.0ms | +0.5ms |
| p95 latency | 2.8ms | 3.5ms | +0.7ms |
| p99 latency | 4.6ms | 6.0ms | +1.4ms |
Rust proxy overhead ~0.5ms p50 / <2ms p99 on commodity hardware. Formal wrk2 benchmark deferred to production neutron deployment (Phase 4).
Operational Notes:
- Linkerd requires
--set proxyInit.iptablesMode=nfton kind (kernel 6.x uses nftables) - Linkerd viz pods need
config.linkerd.io/skip-outbound-ports: 443to reach kube-apiserver - Trust anchor certs expire after 24h on dev install โ
linkerd upgradeor reinstall to rotate - go-httpbin listens on port 8080 by default; Service maps 8000 โ 8080
ServiceProfile: nix build .#service-profile generates CRD with POST /ingest (10s timeout, 20% retry budget) and GET /health routes.
Scalability & Resilienceโ
#46: Multi-Region Strategyโ
Priority: Medium Timeline: Q2 2026 Tasks:
- Design: NATS geo-distribution (leafnodes)
- Design: K8s multi-cluster federation
- Design: DNS-based traffic routing
- Document: Data sovereignty considerations
- Document: Disaster recovery procedures
- POC: 2-region deployment
#47: Chaos Engineeringโ
Status: โ Done Priority: High Timeline: Q2 2026 Tasks:
- Test: Process termination + restart (Phase 4: Graceful Shutdown + MTTR)
- Test: Network latency injection (toxiproxy โ Phase 3)
- Test: NATS broker restart under load (Phase 1)
- Test: Database connection loss (TimescaleDB โ Phase 5)
- Test: Upstream timeout simulation (toxiproxy timeout toxic โ Phase 3b)
- Validate: Circuit breaker lifecycle (closed โ open โ half-open โ closed โ Phase 2)
- Validate: Retry logic (proxy_to_neutron 3-attempt exponential backoff)
- Validate: Graceful degradation contract (/health always 200 โ Phase 6)
- Document:
CHAOS_ENGINEERING.md+scripts/chaos-test.sh
Script: ./scripts/chaos-test.sh (6 phases, ~400 LOC bash)
Infra: toxiproxy added to nix develop (commonBuildInputs in flake.nix)
๐ฎ Phase 5: Advanced Features (Future)โ
Timeline: Q3 2026+ Status: Planning
Potential Featuresโ
- Auto-scaling based on custom metrics (HPA with Prometheus adapter)
- Blue-green deployments (Flagger + Istio)
- A/B testing framework (Traffic splitting)
- Multi-tenancy (Namespace isolation, resource quotas)
- Cost optimization (Spot instances, vertical pod autoscaling)
- Advanced observability (Distributed profiling, eBPF tracing)
- ML-based anomaly detection (Prometheus + custom models)
๐ Current Status Summaryโ
Completedโ
- Phase 1: Core infrastructure โ
- Phase 2: Production readiness โ (22 tasks)
In Progressโ
- Phase 3: Validation & testing โ (7 tasks, 7 done)
Plannedโ
- Phase 4: Enterprise features ๐ (5 tasks)
- Phase 5: Advanced features ๐ญ (Future)
Task Breakdownโ
- โ Completed: 31 tasks (Phase 1โ3 + #43 Security Audit + #45 Linkerd + #47 Chaos)
- ๐ In Progress: Phase 4 (#46 Multi-Region)
- ๐ Planned: 1 task (#46 Multi-Region Strategy)
- ๐ญ Future: 7+ features (Phase 5)
๐ฏ Success Criteriaโ
Phase 3 (Validation)โ
- All integration tests passing with NATS
- Successful deployment to local K8s cluster
- Load test baseline established (RPS, latency p50/p95/p99)
- E2E tracing validated in Jaeger
- Property-based crypto tests passing
Phase 4 (Enterprise)โ
- Security audit clean (no critical/high vulnerabilities)
- Chaos tests demonstrating 99.9% uptime
- Multi-region deployment documented
- Service mesh decision documented (ADR-0040)
Phase 5 (Advanced)โ
- Auto-scaling responding to traffic spikes
- Blue-green deployments automated
- Multi-tenant isolation validated
- Cost per request optimized
๐ Resourcesโ
Documentationโ
PHASE_2_COMPLETE.md- Phase 2 achievementsKUBERNETES.md- Deployment guideADR_REFERENCE.md- Architecture decisionsadr-ledger/docs/SPECTRE_ARCHITECTURE_DECISIONS.md- Full ADR catalog
Code Locationsโ
- Core:
crates/spectre-{core,events,proxy,secrets,observability}/ - Nix:
nix/kubernetes/,nix/services/nats/,flake.nix - Helm:
charts/spectre-proxy/ - CI/CD:
.github/workflows/ci.yml
Quick Commandsโ
# Development
nix develop # Enter dev shell
cargo build --release # Build all crates
cargo test --workspace --lib # Run unit tests
# Infrastructure (local dev)
nix run .#nats # Start NATS server (Nix-native)
docker-compose up -d # Start Jaeger, Prometheus, etc.
docker-compose down # Stop docker services
# Testing (Phase 3)
cargo test --test test_event_bus # Integration tests (requires NATS)
./scripts/load-test.sh # Load testing
# Container Images (Nix-only, no Docker build)
nix build .#spectre-proxy-image # Build OCI image
docker load < result # Load to Docker daemon
skopeo copy docker-archive:result docker://registry/spectre:tag # Push
# Deployment
nix build .#kubernetes-manifests-dev # Generate manifests
nix run .#deploy-dev # Deploy to K8s
helm install spectre charts/spectre-proxy # Or use Helm
# CI/CD
git push origin main # Triggers 10-job pipeline (no Docker build)
๐ Lessons Learned (Continuous)โ
Phase 2 Key Insightsโ
- Nix reproducibility > Community size for infrastructure
- Circuit breakers first - Fail-fast prevents cascades
- Build-time validation - Catch errors before deployment
- SBOM automation - Supply chain security from day 1
Next Phase Focusโ
- Integration testing is critical - Unit tests alone insufficient
- Real load testing matters - Synthetic benchmarks miss edge cases
- Observability debt compounds - Add metrics/tracing early
- Documentation is code - ADRs prevent re-learning decisions
Note: This roadmap is living document. Tasks may be reprioritized based on production feedback and business needs.
Last reviewed: 2026-02-17