ML-Ops API β Inference Platform
Status: Active Development
Date: 2026-04-07
GPU Target: NVIDIA B200 / B300 (Blackwell) via NVIDIA Inception
Overviewβ
ml-ops-api is the inference backbone of the voidnxlabs platform.
It routes LLM requests from all services (neoland, cerebro, arch-analyzer) to the
appropriate GPU backend, with full B200 optimization and batch inference support.
Architectureβ
voidnxlabs platform services
β
βΌ
ml-ops-api
βββ tensorforge/ β Rust core (BackendDriver, B200 NixOS module)
β βββ backends/
β β βββ llamacpp/ β llama.cpp Rust client (async, VRAM tracking)
β β βββ vllm/ β vLLM Rust client
β βββ core/ β unified API (axum + inference trait)
β βββ nix/b200-optimized/ β NixOS module for B200 fleet
β βββ scripts/ β Linux bootstrap + pipeline scripts
β
βββ arch-analyzer/ β AI code analysis (Python, platform-wide)
βββ src/analyzer.py β symlink β ~/master/arch-analyzer/src/analyzer.py
tensorforge/scripts β Inference Pipelineβ
The full "entrar, inferir, extrair, exportar, fechar" lifecycle in bash:
| Script | Role |
|---|---|
entrypoint.sh | Single entrypoint β orchestrates all commands: setup, run, analyze, server, health, status |
install-deps.sh | Install ALL apt packages + Python venv for llama.cpp build and arch-analyzer |
bootstrap.sh | Build llama.cpp from source with CUDA, auto-detect GPU arch (sm_XX), write env.sh |
server.sh | start / stop / restart / status / logs / gpu β B200 defaults: -ngl 999 --flash-attn --cont-batching |
model-pull.sh | Download GGUF models from HuggingFace registry; supports resume, HF token |
infer.sh | Single-shot inference: prompt / file / stream / JSON mode |
run.sh | Full pipeline: auto-start server β infer β export β auto-stop |
export.sh | Convert JSON results to text / markdown / jsonl / csv |
health.sh | GPU + server status + latency benchmark; --watch N for live monitoring |
Makefile | Shortcuts: make bootstrap, make run PROMPT="...", make health-watch |
Quick Start (NIM / bare Linux)β
# 1. Setup everything in one shot: deps + build + model
./tensorforge/scripts/entrypoint.sh setup --model qwen2.5-coder-32b
# 2. Full pipeline (auto-start β infer β display β auto-stop)
./tensorforge/scripts/entrypoint.sh run \
--prompt "analyze this Rust codebase for security issues" \
--output result.json
# 3. Export result to markdown
./tensorforge/scripts/export.sh --input result.json --format markdown
# 4. Run platform-wide code analysis (all ~/master/ projects)
./tensorforge/scripts/entrypoint.sh analyze --platform
Batch Inference β Fleet Patternβ
# Prepare prompt files
ls prompts/*.txt
# Start β batch all prompts β results in results/ β stop
./tensorforge/scripts/entrypoint.sh run \
--batch prompts/ \
--output-dir results/ \
--keep-alive # leave server running between jobs
# Export all to markdown
./tensorforge/scripts/export.sh --dir results/ --format markdown --out-dir reports/
# Teardown
./tensorforge/scripts/entrypoint.sh server stop
Manual step-by-step (full control)β
# 1. Install system packages
./tensorforge/scripts/install-deps.sh
# 2. Build llama.cpp with CUDA
./tensorforge/scripts/bootstrap.sh
# 3. Download model
./tensorforge/scripts/model-pull.sh --model qwen2.5-coder-32b
# 4. Start server
./tensorforge/scripts/server.sh start
# 5. Single inference
./tensorforge/scripts/infer.sh "explain this code" --output result.json
# 6. Export
./tensorforge/scripts/export.sh --input result.json --format markdown
# 7. Stop
./tensorforge/scripts/server.sh stop
Model Registryβ
| Name | VRAM | Best For |
|---|---|---|
qwen2.5-coder-7b | 8GB | Fast code analysis |
qwen2.5-coder-32b | 35GB | arch-analyzer default |
llama-3.1-70b | 75GB | General purpose, B200 |
llama-3.3-70b | 75GB | Best instruction following |
deepseek-r1-14b | 16GB | Reasoning, fits 24GB |
deepseek-r1-70b | 75GB | Strong reasoning, B200 |
mistral-24b | 26GB | Balanced |
arch-analyzer β Platform-Wide Code Analysisβ
Integrated into ml-ops-api, using the tensorforge llama.cpp backend as LLM.
Language Templatesβ
| Template | Files | Projects |
|---|---|---|
nix | *.nix | spider-nix, spooknix, nixos modules |
python | *.py (AST) | cerebro, phantom, neoland-agents, neotron |
rust | *.rs | neoland, securellm-bridge, adr-ledger, spectre, tensorforge |
typescript | *.ts/.tsx/.jsx | neoland-ui, portfolio |
auto | auto-detect | any project |
Platform Commandβ
# Full platform analysis via entrypoint
./tensorforge/scripts/entrypoint.sh analyze --platform
# With specific model
./tensorforge/scripts/entrypoint.sh analyze --platform --model llama-3.3-70b
# Single project
./tensorforge/scripts/entrypoint.sh analyze \
--project ~/master/neoland \
--template rust
# Via make (arch-analyzer/Makefile)
make -C arch-analyzer platform MODEL=llama-3.3-70b
make -C arch-analyzer analyze-project TARGET=~/master/neoland TEMPLATE=rust
Output: PLATFORM-REPORT.md + PLATFORM-REPORT.json with:
- Per-project quality scores, security issues, executive summaries
- Cross-project dependency graph (Cargo path deps, flake inputs)
- Platform-level priority actions from LLM
tensorforge β Rust Coreβ
B200 Optimizations (NixOS Module)β
nix/b200-optimized/default.nix β declarative config for NVIDIA B200 (192GB HBM3e):
services.tensorforge.b200Optimized = {
enable = true;
vllm.enable = true; # tensor-parallel-size=4, max-model-len=131072
llamacpp.enable = true; # -ngl 999, --flash-attn, ctx=32768
systemTuning.enable = true;
monitoring.enable = true; # Prometheus + Grafana
};
Key specs:
- VRAM: 192GB HBM3e
- Context: up to 131k tokens (vLLM), 65536 (llama.cpp on B200)
- Tensor parallelism: 4 (B200 4-die architecture)
- Quantization: FP8 (2x throughput)
Backend Traitβ
// tensorforge_core::Backend
async fn infer(request: InferenceRequest) -> TensorForgeResult<InferenceResult>
async fn load_model(model_id: &str, options: LoadOptions) -> TensorForgeResult<LoadResult>
async fn health_check() -> TensorForgeResult<BackendHealth>
async fn vram_usage() -> TensorForgeResult<VramUsage>
fn capabilities() -> BackendCapabilities
Integration with Platformβ
neoland control plane ββHTTPβββΊ ml-ops-api / tensorforge (port 8080)
arch-analyzer ββHTTPβββΊ LLAMACPP_URL=http://127.0.0.1:8080
cerebro RAG ββHTTPβββΊ ml-ops-api / tensorforge
Environment variables:
| Var | Default | Description |
|---|---|---|
LLAMACPP_URL | http://127.0.0.1:8080 | llama.cpp server endpoint |
LLM_PARALLEL | 8 | Concurrent inference slots |
LLM_TIMEOUT | 180 | Request timeout (seconds) |
TF_PREFIX | ~/.tensorforge/llamacpp (non-root) or /opt/tensorforge/llamacpp (root) | Install prefix |
TF_MODELS | $TF_PREFIX/models/gguf | GGUF model directory |
HF_TOKEN | β | HuggingFace token (private models) |
MLOCK | false | Lock model in RAM (set true on bare-metal B200) |
NO_MMAP | false | Disable mmap (set true on bare-metal B200) |
NIM container note:
MLOCKandNO_MMAPdefault tofalsebecause containers typically lackCAP_IPC_LOCK. Enable manually on bare-metal nodes for peak performance.
Azure Deploymentβ
azure-ml/azure-nix-template.sh provisions the VM fleet:
# Foundation (RG, VNet, NSG, Storage)
./azure-ml/azure-nix-template.sh foundation
# Deploy VM β nix binary cache (B2s ~$35/mo)
./azure-ml/azure-nix-template.sh deploy
# Stream VM β app hosting (D4s_v5 ~$140/mo)
./azure-ml/azure-nix-template.sh stream
# Status
./azure-ml/azure-nix-template.sh status
Budget reference (brazilsouth):
- B2s (2 vCPU/4GB): ~$35/mo β CI/nix-cache
- B4ms (4 vCPU/16GB) Spot: ~$20/mo β ephemeral test runs
- D4s_v5 (4 vCPU/16GB): ~$140/mo β stream/app hosting
Maintained by: VoidNxSEC Team
GPU Program: NVIDIA Inception (B200/B300 fleet)