ML Offload API

Unified Multi-Backend ML Model Orchestration & Inference Platform

![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg) ![Built with Nix](https://img.shields.io/badge/Built_With-Nix-5277C3.svg?logo=nixos&logoColor=white) ![Rust](https://img.shields.io/badge/Rust-1.70+-orange.svg?logo=rust) ![Status: Active Development](https://img.shields.io/badge/Status-Active_Development-green.svg)

ML Offload API is a high-performance orchestration layer designed to provide unified access to multiple ML inference backends (Ollama, llama.cpp, vLLM, TGI). It features intelligent VRAM management, automatic backend selection, and deep optimization for NVIDIA Blackwell (B200/B300) architectures.

🏗️ Architecture

graph TD
    subgraph "Platform Services"
        NL[neoland] --> MOA[ml-ops-api]
        CR[cerebro] --> MOA
        AA[arch-analyzer] --> MOA
    end

    subgraph "ml-ops-api Core"
        MOA --> TF[tensorforge / Rust Core]
        TF --> B[Backends]
        B --> LC[llama.cpp]
        B --> VL[vLLM]
        TF --> RM[VRAM Monitoring / NVML]
        TF --> MR[Model Registry / SQLite]
    end

    subgraph "Analysis Layer"
        AA --> PY[Python Analyzer]
        PY --> TF
    end

    subgraph "Infrastructure"
        TF --> NX[NixOS / B200 Optimized]
        TF --> AZ[Azure Fleet]
    end

🚀 Key Features

🎯 Multi-Backend Orchestration - Unified API for Ollama, llama.cpp, vLLM, and TGI.
🧠 Intelligent Routing - Automatic backend selection based on VRAM availability, load, and model requirements.
📊 Real-time Monitoring - GPU memory tracking via NVIDIA NVML for precise resource allocation.
🏎️ Blackwell Optimized - Specialized NixOS modules for NVIDIA B200 (192GB HBM3e) with FP8 quantization and tensor parallelism.
🔌 WebSocket & Streaming - Real-time streaming inference for low-latency applications.
📦 Nix-First Workflow - Reproducible builds, development shells, and declarative deployment via Nix Flakes.

🛠️ Components

1. TensorForge (Rust Core)

The heart of the platform, built with Axum and Tokio. It defines a unified Backend trait and manages the lifecycle of inference processes.

2. Arch-Analyzer (Python)

An integrated code analysis tool that leverages the platform's LLM capabilities to perform deep structural analysis of codebases across multiple languages (Rust, Python, Nix, TypeScript).

📋 Inference Pipeline

The platform provides a suite of scripts for the full "bootstrap, pull, run, export" lifecycle:

Script	Role
`bootstrap.sh`	Build llama.cpp with CUDA, auto-detect GPU arch (sm_XX).
`server.sh`	Manage server lifecycle (start/stop/restart/status).
`model-pull.sh`	Download GGUF models from HuggingFace.
`run.sh`	Full pipeline: auto-start server → infer → export → auto-stop.
`export.sh`	Convert JSON results to MD, Text, JSONL, or CSV.
`health.sh`	GPU + server status + latency benchmarks.

🚀 Quick Start

Nix (Recommended)

# Enter development shell
nix develop

# Build the project
nix build

# Run the API
nix run

Bare Metal / Pipeline

# 1. Build backend (llama.cpp)
./tensorforge/scripts/bootstrap.sh

# 2. Pull a model
./tensorforge/scripts/model-pull.sh --model qwen2.5-coder-32b

# 3. Run a one-shot inference pipeline
./tensorforge/scripts/run.sh --prompt "Explain quantum computing" --output result.json

Docker

# Start infrastructure (PostgreSQL & MLflow)
docker-compose up -d

⚙️ Configuration

Variable	Default	Description
`LLAMACPP_URL`	`http://127.0.0.1:8080`	llama.cpp server endpoint
`LLM_PARALLEL`	`8`	Concurrent inference slots
`TF_MODELS`	`/var/lib/tensorforge/models/gguf`	Model storage path
`HF_TOKEN`	(Required for private)	HuggingFace Access Token

💎 B200 Blackwell Optimization

Declarative configuration via NixOS module:

services.tensorforge.b200Optimized = {
  enable = true;
  vllm.enable = true;       # tensor-parallel-size=4
  llamacpp.enable = true;   # --flash-attn, ctx=32768
  monitoring.enable = true; # Prometheus + Grafana
};

Maintained by: VoidNxSEC TeamGPU Program: NVIDIA Inception (B200/B300 fleet)

🏗️ Architecture​

🚀 Key Features​

🛠️ Components​

1. TensorForge (Rust Core)​

2. Arch-Analyzer (Python)​

📋 Inference Pipeline​

🚀 Quick Start​

Nix (Recommended)​

Bare Metal / Pipeline​

Docker​

⚙️ Configuration​

💎 B200 Blackwell Optimization​