Skip to main content

ML Offload API

Unified Multi-Backend ML Model Orchestration & Inference Platform

![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg) ![Built with Nix](https://img.shields.io/badge/Built_With-Nix-5277C3.svg?logo=nixos&logoColor=white) ![Rust](https://img.shields.io/badge/Rust-1.70+-orange.svg?logo=rust) ![Status: Active Development](https://img.shields.io/badge/Status-Active_Development-green.svg)

ML Offload API is a high-performance orchestration layer designed to provide unified access to multiple ML inference backends (Ollama, llama.cpp, vLLM, TGI). It features intelligent VRAM management, automatic backend selection, and deep optimization for NVIDIA Blackwell (B200/B300) architectures.


πŸ—οΈ Architecture​

graph TD
subgraph "Platform Services"
NL[neoland] --> MOA[ml-ops-api]
CR[cerebro] --> MOA
AA[arch-analyzer] --> MOA
end

subgraph "ml-ops-api Core"
MOA --> TF[tensorforge / Rust Core]
TF --> B[Backends]
B --> LC[llama.cpp]
B --> VL[vLLM]
TF --> RM[VRAM Monitoring / NVML]
TF --> MR[Model Registry / SQLite]
end

subgraph "Analysis Layer"
AA --> PY[Python Analyzer]
PY --> TF
end

subgraph "Infrastructure"
TF --> NX[NixOS / B200 Optimized]
TF --> AZ[Azure Fleet]
end

πŸš€ Key Features​

  • 🎯 Multi-Backend Orchestration - Unified API for Ollama, llama.cpp, vLLM, and TGI.
  • 🧠 Intelligent Routing - Automatic backend selection based on VRAM availability, load, and model requirements.
  • πŸ“Š Real-time Monitoring - GPU memory tracking via NVIDIA NVML for precise resource allocation.
  • 🏎️ Blackwell Optimized - Specialized NixOS modules for NVIDIA B200 (192GB HBM3e) with FP8 quantization and tensor parallelism.
  • πŸ”Œ WebSocket & Streaming - Real-time streaming inference for low-latency applications.
  • πŸ“¦ Nix-First Workflow - Reproducible builds, development shells, and declarative deployment via Nix Flakes.

πŸ› οΈ Components​

1. TensorForge (Rust Core)​

The heart of the platform, built with Axum and Tokio. It defines a unified Backend trait and manages the lifecycle of inference processes.

2. Arch-Analyzer (Python)​

An integrated code analysis tool that leverages the platform's LLM capabilities to perform deep structural analysis of codebases across multiple languages (Rust, Python, Nix, TypeScript).


πŸ“‹ Inference Pipeline​

The platform provides a suite of scripts for the full "bootstrap, pull, run, export" lifecycle:

ScriptRole
bootstrap.shBuild llama.cpp with CUDA, auto-detect GPU arch (sm_XX).
server.shManage server lifecycle (start/stop/restart/status).
model-pull.shDownload GGUF models from HuggingFace.
run.shFull pipeline: auto-start server β†’ infer β†’ export β†’ auto-stop.
export.shConvert JSON results to MD, Text, JSONL, or CSV.
health.shGPU + server status + latency benchmarks.

πŸš€ Quick Start​

# Enter development shell
nix develop

# Build the project
nix build

# Run the API
nix run

Bare Metal / Pipeline​

# 1. Build backend (llama.cpp)
./tensorforge/scripts/bootstrap.sh

# 2. Pull a model
./tensorforge/scripts/model-pull.sh --model qwen2.5-coder-32b

# 3. Run a one-shot inference pipeline
./tensorforge/scripts/run.sh --prompt "Explain quantum computing" --output result.json

Docker​

# Start infrastructure (PostgreSQL & MLflow)
docker-compose up -d

βš™οΈ Configuration​

VariableDefaultDescription
LLAMACPP_URLhttp://127.0.0.1:8080llama.cpp server endpoint
LLM_PARALLEL8Concurrent inference slots
TF_MODELS/var/lib/tensorforge/models/ggufModel storage path
HF_TOKEN(Required for private)HuggingFace Access Token

πŸ’Ž B200 Blackwell Optimization​

Declarative configuration via NixOS module:

services.tensorforge.b200Optimized = {
enable = true;
vllm.enable = true; # tensor-parallel-size=4
llamacpp.enable = true; # --flash-attn, ctx=32768
monitoring.enable = true; # Prometheus + Grafana
};

Maintained by: VoidNxSEC TeamGPU Program: NVIDIA Inception (B200/B300 fleet)