ML Offload API
Unified Multi-Backend ML Model Orchestration & Inference Platform
   
ML Offload API is a high-performance orchestration layer designed to provide unified access to multiple ML inference backends (Ollama, llama.cpp, vLLM, TGI). It features intelligent VRAM management, automatic backend selection, and deep optimization for NVIDIA Blackwell (B200/B300) architectures.
ποΈ Architectureβ
graph TD
subgraph "Platform Services"
NL[neoland] --> MOA[ml-ops-api]
CR[cerebro] --> MOA
AA[arch-analyzer] --> MOA
end
subgraph "ml-ops-api Core"
MOA --> TF[tensorforge / Rust Core]
TF --> B[Backends]
B --> LC[llama.cpp]
B --> VL[vLLM]
TF --> RM[VRAM Monitoring / NVML]
TF --> MR[Model Registry / SQLite]
end
subgraph "Analysis Layer"
AA --> PY[Python Analyzer]
PY --> TF
end
subgraph "Infrastructure"
TF --> NX[NixOS / B200 Optimized]
TF --> AZ[Azure Fleet]
end
π Key Featuresβ
- π― Multi-Backend Orchestration - Unified API for Ollama, llama.cpp, vLLM, and TGI.
- π§ Intelligent Routing - Automatic backend selection based on VRAM availability, load, and model requirements.
- π Real-time Monitoring - GPU memory tracking via NVIDIA NVML for precise resource allocation.
- ποΈ Blackwell Optimized - Specialized NixOS modules for NVIDIA B200 (192GB HBM3e) with FP8 quantization and tensor parallelism.
- π WebSocket & Streaming - Real-time streaming inference for low-latency applications.
- π¦ Nix-First Workflow - Reproducible builds, development shells, and declarative deployment via Nix Flakes.
π οΈ Componentsβ
1. TensorForge (Rust Core)β
The heart of the platform, built with Axum and Tokio. It defines a unified Backend trait and manages the lifecycle of inference processes.
2. Arch-Analyzer (Python)β
An integrated code analysis tool that leverages the platform's LLM capabilities to perform deep structural analysis of codebases across multiple languages (Rust, Python, Nix, TypeScript).
π Inference Pipelineβ
The platform provides a suite of scripts for the full "bootstrap, pull, run, export" lifecycle:
| Script | Role |
|---|---|
bootstrap.sh | Build llama.cpp with CUDA, auto-detect GPU arch (sm_XX). |
server.sh | Manage server lifecycle (start/stop/restart/status). |
model-pull.sh | Download GGUF models from HuggingFace. |
run.sh | Full pipeline: auto-start server β infer β export β auto-stop. |
export.sh | Convert JSON results to MD, Text, JSONL, or CSV. |
health.sh | GPU + server status + latency benchmarks. |
π Quick Startβ
Nix (Recommended)β
# Enter development shell
nix develop
# Build the project
nix build
# Run the API
nix run
Bare Metal / Pipelineβ
# 1. Build backend (llama.cpp)
./tensorforge/scripts/bootstrap.sh
# 2. Pull a model
./tensorforge/scripts/model-pull.sh --model qwen2.5-coder-32b
# 3. Run a one-shot inference pipeline
./tensorforge/scripts/run.sh --prompt "Explain quantum computing" --output result.json
Dockerβ
# Start infrastructure (PostgreSQL & MLflow)
docker-compose up -d
βοΈ Configurationβ
| Variable | Default | Description |
|---|---|---|
LLAMACPP_URL | http://127.0.0.1:8080 | llama.cpp server endpoint |
LLM_PARALLEL | 8 | Concurrent inference slots |
TF_MODELS | /var/lib/tensorforge/models/gguf | Model storage path |
HF_TOKEN | (Required for private) | HuggingFace Access Token |
π B200 Blackwell Optimizationβ
Declarative configuration via NixOS module:
services.tensorforge.b200Optimized = {
enable = true;
vllm.enable = true; # tensor-parallel-size=4
llamacpp.enable = true; # --flash-attn, ctx=32768
monitoring.enable = true; # Prometheus + Grafana
};
Maintained by: VoidNxSEC TeamGPU Program: NVIDIA Inception (B200/B300 fleet)