README
βββββββ βββ βββ ββββββ ββββ ββββββββββββ βββββββ ββββ ββββ
βββββββββββ ββββββββββββββββ ββββββββββββββββββββββββββ βββββ
ββββββββββββββββββββββββββββββ βββ βββ βββ ββββββββββββββ
βββββββ ββββββββββββββββββββββββββ βββ βββ ββββββββββββββ
βββ βββ ββββββ ββββββ ββββββ βββ ββββββββββββ βββ βββ
βββ βββ ββββββ ββββββ βββββ βββ βββββββ βββ βββ
Local-first AI document intelligence. No cloud required. No excuses.
Phantom is a production-grade document intelligence engine that classifies, sanitizes, and understands unstructured data β locally, privately, and fast.
It's not a wrapper around an API. It's not a demo. It runs entirely on your hardware, speaks to local LLMs via llama.cpp, indexes your documents into FAISS, and answers questions about them through a hybrid RAG pipeline. The only dependency is Nix.
What it does in one sentence: feed it documents, get back structured intelligence β themes, patterns, PII reports, vector search, and RAG-powered chat β without a single byte leaving your machine.
What's Insideβ
phantom/
βββ CORTEX Engine β semantic chunking, LLM classification, VRAM-aware
βββ RAG Pipeline β FAISS + BM25 hybrid search with RRF fusion
βββ FastAPI Server β 20 endpoints, SSE streaming, Prometheus metrics
βββ DAG Pipeline β file classification, PII detection, sanitization
βββ IntelAgent (Rust) β 8-crate workspace: governance, security, memory, MCP
βββ Cortex Desktop β Tauri 2 + SvelteKit GUI
βββ CLI β Typer-based, scriptable, composable
Quickstartβ
You need Nix. That's it.
git clone https://github.com/kernelcore/phantom
cd phantom
# Drop into the fully-pinned dev environment
nix develop
# Run the test suite to confirm everything works
just test
# Start the API server
just serve
# Or run the full desktop app
just desktop
No pip install. No virtualenv. No "works on my machine." The environment is hermetic and reproducible β today, in six months, on any machine.
Core Capabilitiesβ
CORTEX Engineβ
The heart of Phantom. Processes raw documents into structured insights through a multi-stage pipeline:
Document β SemanticChunker β EmbeddingGenerator β LLM Classifier β Pydantic Schema
- Chunking with configurable token budgets (default: 1024 tokens, 128 overlap)
- Parallel LLM calls with retry logic (3 attempts, 2s backoff)
- Real-time VRAM monitoring with auto-throttle β won't OOM your GPU
- Extracts:
Theme,Pattern,Learning,Concept,Recommendation
# Process a document directory
just run extract --input ./docs --output ./insights.json
# Or hit the API directly
curl -X POST http://localhost:8000/process \
-F "file=@report.pdf" \
-F "chunk_strategy=recursive" \
-F "chunk_size=1024"
Hybrid Vector Searchβ
Most RAG systems pick either semantic or keyword search. Phantom does both and fuses the results using Reciprocal Rank Fusion:
Query β FAISS (dense cosine) ββ
ββ RRF Fusion β Ranked Results
Query β BM25Okapi (sparse) ββββ
- FAISS
IndexFlatIPwith L2-normalized cosine similarity - Optional GPU acceleration via
StandardGpuResources - BM25 index rebuilt lazily on each
add()β no manual sync required
curl -X POST http://localhost:8000/vectors/search \
-H "Content-Type: application/json" \
-d '{"query": "compliance requirements", "top_k": 5, "search_type": "hybrid"}'
RAG Chat with Streamingβ
Context-aware chat over your document base. Supports SSE streaming for real-time token delivery.
# Streaming chat
curl -X POST http://localhost:8000/api/chat/stream \
-H "Content-Type: application/json" \
-d '{
"message": "What are the key risks in the Q3 report?",
"conversation_id": "session-001",
"history": [],
"context_size": 5
}'
The LLM provider is fully abstracted. Default is llama.cpp over HTTP (OpenAI-compatible). OpenAI, Anthropic, and DeepSeek slots exist in the provider registry β they're just not wired yet. That's intentional: we're not building cloud lock-in.
Data Sanitization Pipelineβ
Phantom's DAG pipeline processes files through a classification and sanitization chain before they ever touch your vector store:
Discovery β Fingerprint β Classify β Pseudonymize β Sanitize β Verify β Persist
Four sanitization levels:
| Level | What happens |
|---|---|
none | Direct copy, no modifications |
strip_metadata | EXIF, document properties, author fields removed |
redact_pii | Email, phone, SSN, CPF/CNPJ, credit cards replaced with [REDACTED] |
full_sanitize | Everything above + content normalization |
PII detection covers: email addresses, phone numbers, SSN, CPF/CNPJ, payment card numbers, AWS credentials, API keys, Bearer tokens, private keys, PGP blocks, IPv4/IPv6 ranges, UUIDs.
# Scan a directory for sensitive content
phantom-scan ./repo | jq '.findings[] | select(.risk_score > 0.7)'
# Sanitize before exporting
phantom-dag -i ./internal_dataset -o ./export --sanitize pii
# Dry-run to preview what would happen
phantom -i ./input -o ./output --dry-run
Cryptographic Integrityβ
Every file processed gets a hash. You choose the algorithm:
| Algorithm | Use case |
|---|---|
| SHA256 | Baseline integrity, broad compatibility |
| BLAKE3 | High-throughput, modern standard |
| xxHash | Maximum speed, block-level streaming |
# Generate a manifest
phantom-hash ./directory > manifest.json
# Verify a file against a known hash
phantom-verify report.pdf abc123def456...
# Diff two manifests (transfer verification)
diff <(jq -S . before.json) <(jq -S . after.json)
API Referenceβ
The FastAPI server runs at http://localhost:8000 by default. Prometheus metrics at /metrics, OpenAPI docs at /docs.
| Endpoint | Method | Purpose |
|---|---|---|
/health | GET | Liveness probe |
/ready | GET | Readiness check with downstream deps |
/metrics | GET | Prometheus metrics |
/api/system/metrics | GET | CPU, RAM, VRAM, disk |
/process | POST | Process document with CORTEX |
/extract | POST | Extract insights from text |
/upload | POST | Single file upload |
/api/upload | POST | Multi-file upload with processing |
/vectors/search | POST | Hybrid vector search |
/vectors/index | POST | Index document to FAISS |
/vectors/batch-index | POST | Batch indexing |
/api/chat | POST | RAG-powered chat |
/api/chat/stream | POST | SSE streaming chat |
/api/models | GET | List available LLM models |
/api/prompt/test | POST | Render and token-count a prompt |
/api/pipeline | POST | Full DAG pipeline execution |
/api/pipeline/scan | POST | Scan-only (read-only, no writes) |
/judge | POST | AI-Agent-OS judgment integration |
All request/response bodies are validated by Pydantic v2. No silent failures.
Output Structureβ
output/
βββ documents/ # PDF, DOCX, TXT, MD
βββ images/ # PNG, JPG, SVG
βββ audio/ # MP3, FLAC, WAV
βββ video/ # MP4, MKV, AVI
βββ code/ # PY, JS, RS, GO, NIX
βββ data/ # JSON, CSV, PARQUET
βββ archives/ # ZIP, TAR, 7Z
βββ configs/ # ENV, CONF, INI
βββ logs/ # LOG, OUT, ERR
βββ crypto/ # PEM, KEY, P12
βββ executables/ # ELF, EXE, DEB
βββ unknown/ # Unclassified
βββ .phantom/
βββ phantom.db # SQLite audit log
βββ pseudonym_map.json # Reversible path mapping
βββ reports/ # JSON execution reports
βββ audit/ # Chain of custody
βββ staging/ # Processing scratch space
βββ quarantine/ # Files that failed validation
Execution Reportβ
{
"phantom_version": "0.0.1",
"statistics": {
"total_files": 15420,
"processed": 15398,
"failed": 22,
"success_rate": "99.86%",
"total_size_human": "48.32 GB",
"duration_seconds": "127.45",
"throughput_files_per_sec": "120.81",
"files_with_sensitive_data": 847
},
"sensitivity_breakdown": {
"PUBLIC": 12453,
"INTERNAL": 1892,
"CONFIDENTIAL": 734,
"SECRET": 289,
"TOP_SECRET": 30
}
}
Path Pseudonymizationβ
Original paths are replaced with deterministic, reversible pseudonyms. Nothing is lost β the mapping is persisted in pseudonym_map.json.
/home/user/docs/secret_report_2024.pdf
β
PH-a1b2c3d4-e5f6a7b8-1234abcd.pdf
β β β β
β β β ββ Hexadecimal timestamp
β β ββ Random entropy block
β ββ Deterministic path hash
ββ Namespace prefix
# Resolve it back
phantom --resolve PH-a1b2c3d4-e5f6a7b8-1234abcd.pdf
IntelAgent (Rust)β
A separate Rust workspace living inside Phantom's repo. Eight crates, each with a defined responsibility:
| Crate | Purpose |
|---|---|
intelagent-core | Shared types, traits, runtime primitives |
intelagent-mcp | MCP protocol implementation (agent-to-agent comms) |
intelagent-memory | Context windows, knowledge graphs |
intelagent-quality | Automated peer review gates |
intelagent-security | Privacy auditing, ed25519 signing, blake3 hashing |
intelagent-governance | DAO-style rules, reward mechanisms |
intelagent-cli | Command-line interface for the agent |
intelagent-soc | GTK4 Security Operations Center UI |
Built with opt-level=3 + LTO + strip=true. Production binaries, not dev toys.
# Build all Rust crates
nix build .#intelagent
# Run Rust tests (nextest parallel runner)
nix flake check
Testingβ
Three levels. No compromises.
# Everything
just test
# Targeted
just test-unit
just test-integration
just test-e2e
# With coverage report (enforced minimum: 70%)
just test-cov
# GPU-specific tests
just test-gpu
# Match a pattern
just test-match "test_vector"
tests/
βββ conftest.py # Shared fixtures
βββ test_imports.py # Critical import smoke tests
βββ unit/ # 17 test modules (isolated, fast)
βββ integration/ # API + CLI tests (requires running server)
βββ e2e/ # Full pipeline tests (slow, thorough)
Coverage is enforced at 70% minimum via pytest --fail-under=70. The CI will fail before you merge something that regresses it.
Extending Phantomβ
Add a sensitivity patternβ
# src/phantom/pipeline/phantom_dag.py
SENSITIVE_PATTERNS = [
# (regex, label, risk_score)
(r'your_pattern', 'YOUR_LABEL', 0.9),
]
Add a file typeβ
EXT_MAP = {
'.yourext': Classification.DOCUMENTS,
}
Add an LLM providerβ
Implement AIProvider from src/phantom/providers/base.py:
class YourProvider(AIProvider):
async def generate(self, prompt: str, **kwargs) -> GenerationResult: ...
async def stream(self, prompt: str, **kwargs) -> AsyncIterator[str]: ...
async def health_check(self) -> ProviderStatus: ...
Register it in the API's provider resolver. Done.
Known Constraintsβ
- Files over 10MB are skipped during deep PII scanning (magic bytes + extension classification still applies).
- Encrypted file content cannot be classified beyond magic bytes and extension.
- Metadata stripping is best-effort on proprietary formats β some residual metadata may survive.
These are documented tradeoffs, not bugs.
Common Workflowsβ
Normalize legacy storage:
phantom -i /mnt/legacy -o /mnt/normalized -w 8 -v
Export sanitized dataset:
phantom-dag -i ./internal -o ./export --sanitize pii
Audit a repo before committing:
phantom-scan ./project | jq '.findings[] | select(.risk_score > 0.7)'
Verify a data transfer:
phantom-hash ./original > before.json
cp -r ./original ./destination
phantom-hash ./destination > after.json
diff <(jq -S . before.json) <(jq -S . after.json)
Ask questions about your documents:
just serve &
curl -X POST http://localhost:8000/vectors/index -F "file=@docs.pdf"
curl -X POST http://localhost:8000/api/chat \
-d '{"message": "Summarize the main risks", "conversation_id": "s1", "history": []}'
Securityβ
Phantom runs a full security stack on every commit:
- SAST: CodeQL (Python + JavaScript), Bandit
- Dependency audit: pip-audit, safety, cargo-audit
- Secret scanning: Trufflehog, detect-secrets
- SBOM: CycloneDX, SPDX JSON, Syft β every build
- Vulnerability scan: Grype against SBOM
- Supply chain: OpenSSF Scorecard
Found a vulnerability? See SECURITY.md.
Developmentβ
nix develop # enter the pinned shell
just lint # ruff + mypy
just fmt # ruff format
just quality # lint + typecheck + security scan
just ci # lint + test (what CI runs)
just stats # project statistics
just info # environment summary
All tasks live in the justfile. Run just with no arguments to list them.
Pre-commit hooks are installed automatically when you enter nix develop. They run ruff, mypy, and bandit before every commit.
Statusβ
| Component | Status |
|---|---|
| CORTEX Engine | Production ready |
| FAISS Vector Store + Hybrid Search | Production ready |
| FastAPI Server (20 endpoints) | Production ready |
| DAG Pipeline + Sanitization | Production ready |
| Prometheus Metrics + Structlog | Production ready |
| CI/CD (7 workflows) | Production ready |
| IntelAgent Rust Workspace | Production ready |
| Cortex Desktop (Tauri + SvelteKit) | Beta |
| CLI Commands | Complete |
| Cloud LLM Providers | Planned (Q2 2026) |
| Redis Semantic Cache | Planned (Q2 2026) |
| Kubernetes Helm Charts | Planned (Q3 2026) |
Licenseβ
Apache 2.0. See LICENSE.
Contributingβ
Read CONTRIBUTING.md before opening a PR.
For architecture changes or significant API modifications, open an issue first. The docs/adr/ directory has the decision history β read it before proposing something we already debated and rejected.
Contributions welcome. Hot takes about the architecture go in the issues. Fixes go in PRs.