Skip to main content

README

β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ•— β–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ•—β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ•—
β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•—β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•—β–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ•‘β•šβ•β•β–ˆβ–ˆβ•”β•β•β•β–ˆβ–ˆβ•”β•β•β•β–ˆβ–ˆβ•—β–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ–ˆβ•‘
β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•”β•β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•‘β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•‘β–ˆβ–ˆβ•”β–ˆβ–ˆβ•— β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•”β–ˆβ–ˆβ–ˆβ–ˆβ•”β–ˆβ–ˆβ•‘
β–ˆβ–ˆβ•”β•β•β•β• β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘β•šβ–ˆβ–ˆβ•—β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘β•šβ–ˆβ–ˆβ•”β•β–ˆβ–ˆβ•‘
β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘ β•šβ–ˆβ–ˆβ–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•‘ β•šβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•”β•β–ˆβ–ˆβ•‘ β•šβ•β• β–ˆβ–ˆβ•‘
β•šβ•β• β•šβ•β• β•šβ•β•β•šβ•β• β•šβ•β•β•šβ•β• β•šβ•β•β•β• β•šβ•β• β•šβ•β•β•β•β•β• β•šβ•β• β•šβ•β•

Local-first AI document intelligence. No cloud required. No excuses.

CI Security CodeQL Coverage License Python Nix Version


Phantom is a production-grade document intelligence engine that classifies, sanitizes, and understands unstructured data β€” locally, privately, and fast.

It's not a wrapper around an API. It's not a demo. It runs entirely on your hardware, speaks to local LLMs via llama.cpp, indexes your documents into FAISS, and answers questions about them through a hybrid RAG pipeline. The only dependency is Nix.

What it does in one sentence: feed it documents, get back structured intelligence β€” themes, patterns, PII reports, vector search, and RAG-powered chat β€” without a single byte leaving your machine.


What's Inside​

phantom/
β”œβ”€β”€ CORTEX Engine β€” semantic chunking, LLM classification, VRAM-aware
β”œβ”€β”€ RAG Pipeline β€” FAISS + BM25 hybrid search with RRF fusion
β”œβ”€β”€ FastAPI Server β€” 20 endpoints, SSE streaming, Prometheus metrics
β”œβ”€β”€ DAG Pipeline β€” file classification, PII detection, sanitization
β”œβ”€β”€ IntelAgent (Rust) β€” 8-crate workspace: governance, security, memory, MCP
β”œβ”€β”€ Cortex Desktop β€” Tauri 2 + SvelteKit GUI
└── CLI β€” Typer-based, scriptable, composable

Quickstart​

You need Nix. That's it.

git clone https://github.com/kernelcore/phantom
cd phantom

# Drop into the fully-pinned dev environment
nix develop

# Run the test suite to confirm everything works
just test

# Start the API server
just serve

# Or run the full desktop app
just desktop

No pip install. No virtualenv. No "works on my machine." The environment is hermetic and reproducible β€” today, in six months, on any machine.


Core Capabilities​

CORTEX Engine​

The heart of Phantom. Processes raw documents into structured insights through a multi-stage pipeline:

Document β†’ SemanticChunker β†’ EmbeddingGenerator β†’ LLM Classifier β†’ Pydantic Schema
  • Chunking with configurable token budgets (default: 1024 tokens, 128 overlap)
  • Parallel LLM calls with retry logic (3 attempts, 2s backoff)
  • Real-time VRAM monitoring with auto-throttle β€” won't OOM your GPU
  • Extracts: Theme, Pattern, Learning, Concept, Recommendation
# Process a document directory
just run extract --input ./docs --output ./insights.json

# Or hit the API directly
curl -X POST http://localhost:8000/process \
-F "file=@report.pdf" \
-F "chunk_strategy=recursive" \
-F "chunk_size=1024"

Most RAG systems pick either semantic or keyword search. Phantom does both and fuses the results using Reciprocal Rank Fusion:

Query β†’ FAISS (dense cosine) ─┐
β”œβ†’ RRF Fusion β†’ Ranked Results
Query β†’ BM25Okapi (sparse) β”€β”€β”€β”˜
  • FAISS IndexFlatIP with L2-normalized cosine similarity
  • Optional GPU acceleration via StandardGpuResources
  • BM25 index rebuilt lazily on each add() β€” no manual sync required
curl -X POST http://localhost:8000/vectors/search \
-H "Content-Type: application/json" \
-d '{"query": "compliance requirements", "top_k": 5, "search_type": "hybrid"}'

RAG Chat with Streaming​

Context-aware chat over your document base. Supports SSE streaming for real-time token delivery.

# Streaming chat
curl -X POST http://localhost:8000/api/chat/stream \
-H "Content-Type: application/json" \
-d '{
"message": "What are the key risks in the Q3 report?",
"conversation_id": "session-001",
"history": [],
"context_size": 5
}'

The LLM provider is fully abstracted. Default is llama.cpp over HTTP (OpenAI-compatible). OpenAI, Anthropic, and DeepSeek slots exist in the provider registry β€” they're just not wired yet. That's intentional: we're not building cloud lock-in.

Data Sanitization Pipeline​

Phantom's DAG pipeline processes files through a classification and sanitization chain before they ever touch your vector store:

Discovery β†’ Fingerprint β†’ Classify β†’ Pseudonymize β†’ Sanitize β†’ Verify β†’ Persist

Four sanitization levels:

LevelWhat happens
noneDirect copy, no modifications
strip_metadataEXIF, document properties, author fields removed
redact_piiEmail, phone, SSN, CPF/CNPJ, credit cards replaced with [REDACTED]
full_sanitizeEverything above + content normalization

PII detection covers: email addresses, phone numbers, SSN, CPF/CNPJ, payment card numbers, AWS credentials, API keys, Bearer tokens, private keys, PGP blocks, IPv4/IPv6 ranges, UUIDs.

# Scan a directory for sensitive content
phantom-scan ./repo | jq '.findings[] | select(.risk_score > 0.7)'

# Sanitize before exporting
phantom-dag -i ./internal_dataset -o ./export --sanitize pii

# Dry-run to preview what would happen
phantom -i ./input -o ./output --dry-run

Cryptographic Integrity​

Every file processed gets a hash. You choose the algorithm:

AlgorithmUse case
SHA256Baseline integrity, broad compatibility
BLAKE3High-throughput, modern standard
xxHashMaximum speed, block-level streaming
# Generate a manifest
phantom-hash ./directory > manifest.json

# Verify a file against a known hash
phantom-verify report.pdf abc123def456...

# Diff two manifests (transfer verification)
diff <(jq -S . before.json) <(jq -S . after.json)

API Reference​

The FastAPI server runs at http://localhost:8000 by default. Prometheus metrics at /metrics, OpenAPI docs at /docs.

EndpointMethodPurpose
/healthGETLiveness probe
/readyGETReadiness check with downstream deps
/metricsGETPrometheus metrics
/api/system/metricsGETCPU, RAM, VRAM, disk
/processPOSTProcess document with CORTEX
/extractPOSTExtract insights from text
/uploadPOSTSingle file upload
/api/uploadPOSTMulti-file upload with processing
/vectors/searchPOSTHybrid vector search
/vectors/indexPOSTIndex document to FAISS
/vectors/batch-indexPOSTBatch indexing
/api/chatPOSTRAG-powered chat
/api/chat/streamPOSTSSE streaming chat
/api/modelsGETList available LLM models
/api/prompt/testPOSTRender and token-count a prompt
/api/pipelinePOSTFull DAG pipeline execution
/api/pipeline/scanPOSTScan-only (read-only, no writes)
/judgePOSTAI-Agent-OS judgment integration

All request/response bodies are validated by Pydantic v2. No silent failures.


Output Structure​

output/
β”œβ”€β”€ documents/ # PDF, DOCX, TXT, MD
β”œβ”€β”€ images/ # PNG, JPG, SVG
β”œβ”€β”€ audio/ # MP3, FLAC, WAV
β”œβ”€β”€ video/ # MP4, MKV, AVI
β”œβ”€β”€ code/ # PY, JS, RS, GO, NIX
β”œβ”€β”€ data/ # JSON, CSV, PARQUET
β”œβ”€β”€ archives/ # ZIP, TAR, 7Z
β”œβ”€β”€ configs/ # ENV, CONF, INI
β”œβ”€β”€ logs/ # LOG, OUT, ERR
β”œβ”€β”€ crypto/ # PEM, KEY, P12
β”œβ”€β”€ executables/ # ELF, EXE, DEB
β”œβ”€β”€ unknown/ # Unclassified
└── .phantom/
β”œβ”€β”€ phantom.db # SQLite audit log
β”œβ”€β”€ pseudonym_map.json # Reversible path mapping
β”œβ”€β”€ reports/ # JSON execution reports
β”œβ”€β”€ audit/ # Chain of custody
β”œβ”€β”€ staging/ # Processing scratch space
└── quarantine/ # Files that failed validation

Execution Report​

{
"phantom_version": "0.0.1",
"statistics": {
"total_files": 15420,
"processed": 15398,
"failed": 22,
"success_rate": "99.86%",
"total_size_human": "48.32 GB",
"duration_seconds": "127.45",
"throughput_files_per_sec": "120.81",
"files_with_sensitive_data": 847
},
"sensitivity_breakdown": {
"PUBLIC": 12453,
"INTERNAL": 1892,
"CONFIDENTIAL": 734,
"SECRET": 289,
"TOP_SECRET": 30
}
}

Path Pseudonymization​

Original paths are replaced with deterministic, reversible pseudonyms. Nothing is lost β€” the mapping is persisted in pseudonym_map.json.

/home/user/docs/secret_report_2024.pdf
↓
PH-a1b2c3d4-e5f6a7b8-1234abcd.pdf
β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ β”‚ └─ Hexadecimal timestamp
β”‚ β”‚ └─ Random entropy block
β”‚ └─ Deterministic path hash
└─ Namespace prefix

# Resolve it back
phantom --resolve PH-a1b2c3d4-e5f6a7b8-1234abcd.pdf

IntelAgent (Rust)​

A separate Rust workspace living inside Phantom's repo. Eight crates, each with a defined responsibility:

CratePurpose
intelagent-coreShared types, traits, runtime primitives
intelagent-mcpMCP protocol implementation (agent-to-agent comms)
intelagent-memoryContext windows, knowledge graphs
intelagent-qualityAutomated peer review gates
intelagent-securityPrivacy auditing, ed25519 signing, blake3 hashing
intelagent-governanceDAO-style rules, reward mechanisms
intelagent-cliCommand-line interface for the agent
intelagent-socGTK4 Security Operations Center UI

Built with opt-level=3 + LTO + strip=true. Production binaries, not dev toys.

# Build all Rust crates
nix build .#intelagent

# Run Rust tests (nextest parallel runner)
nix flake check

Testing​

Three levels. No compromises.

# Everything
just test

# Targeted
just test-unit
just test-integration
just test-e2e

# With coverage report (enforced minimum: 70%)
just test-cov

# GPU-specific tests
just test-gpu

# Match a pattern
just test-match "test_vector"
tests/
β”œβ”€β”€ conftest.py # Shared fixtures
β”œβ”€β”€ test_imports.py # Critical import smoke tests
β”œβ”€β”€ unit/ # 17 test modules (isolated, fast)
β”œβ”€β”€ integration/ # API + CLI tests (requires running server)
└── e2e/ # Full pipeline tests (slow, thorough)

Coverage is enforced at 70% minimum via pytest --fail-under=70. The CI will fail before you merge something that regresses it.


Extending Phantom​

Add a sensitivity pattern​

# src/phantom/pipeline/phantom_dag.py
SENSITIVE_PATTERNS = [
# (regex, label, risk_score)
(r'your_pattern', 'YOUR_LABEL', 0.9),
]

Add a file type​

EXT_MAP = {
'.yourext': Classification.DOCUMENTS,
}

Add an LLM provider​

Implement AIProvider from src/phantom/providers/base.py:

class YourProvider(AIProvider):
async def generate(self, prompt: str, **kwargs) -> GenerationResult: ...
async def stream(self, prompt: str, **kwargs) -> AsyncIterator[str]: ...
async def health_check(self) -> ProviderStatus: ...

Register it in the API's provider resolver. Done.


Known Constraints​

  • Files over 10MB are skipped during deep PII scanning (magic bytes + extension classification still applies).
  • Encrypted file content cannot be classified beyond magic bytes and extension.
  • Metadata stripping is best-effort on proprietary formats β€” some residual metadata may survive.

These are documented tradeoffs, not bugs.


Common Workflows​

Normalize legacy storage:

phantom -i /mnt/legacy -o /mnt/normalized -w 8 -v

Export sanitized dataset:

phantom-dag -i ./internal -o ./export --sanitize pii

Audit a repo before committing:

phantom-scan ./project | jq '.findings[] | select(.risk_score > 0.7)'

Verify a data transfer:

phantom-hash ./original > before.json
cp -r ./original ./destination
phantom-hash ./destination > after.json
diff <(jq -S . before.json) <(jq -S . after.json)

Ask questions about your documents:

just serve &
curl -X POST http://localhost:8000/vectors/index -F "file=@docs.pdf"
curl -X POST http://localhost:8000/api/chat \
-d '{"message": "Summarize the main risks", "conversation_id": "s1", "history": []}'

Security​

Phantom runs a full security stack on every commit:

  • SAST: CodeQL (Python + JavaScript), Bandit
  • Dependency audit: pip-audit, safety, cargo-audit
  • Secret scanning: Trufflehog, detect-secrets
  • SBOM: CycloneDX, SPDX JSON, Syft β€” every build
  • Vulnerability scan: Grype against SBOM
  • Supply chain: OpenSSF Scorecard

Found a vulnerability? See SECURITY.md.


Development​

nix develop # enter the pinned shell
just lint # ruff + mypy
just fmt # ruff format
just quality # lint + typecheck + security scan
just ci # lint + test (what CI runs)
just stats # project statistics
just info # environment summary

All tasks live in the justfile. Run just with no arguments to list them.

Pre-commit hooks are installed automatically when you enter nix develop. They run ruff, mypy, and bandit before every commit.


Status​

ComponentStatus
CORTEX EngineProduction ready
FAISS Vector Store + Hybrid SearchProduction ready
FastAPI Server (20 endpoints)Production ready
DAG Pipeline + SanitizationProduction ready
Prometheus Metrics + StructlogProduction ready
CI/CD (7 workflows)Production ready
IntelAgent Rust WorkspaceProduction ready
Cortex Desktop (Tauri + SvelteKit)Beta
CLI CommandsComplete
Cloud LLM ProvidersPlanned (Q2 2026)
Redis Semantic CachePlanned (Q2 2026)
Kubernetes Helm ChartsPlanned (Q3 2026)

License​

Apache 2.0. See LICENSE.


Contributing​

Read CONTRIBUTING.md before opening a PR.

For architecture changes or significant API modifications, open an issue first. The docs/adr/ directory has the decision history β€” read it before proposing something we already debated and rejected.

Contributions welcome. Hot takes about the architecture go in the issues. Fixes go in PRs.