Skip to main content

tensorforge β€” Getting Started Guide

GPU inference pipeline: llama.cpp + arch-analyzer + batch inference
For NVIDIA Inception (B200/B300), Shadeform, NIM containers, and NixOS


Choose your path​

EnvironmentScriptTime
Bare Linux (Ubuntu/Debian)entrypoint.sh setup~10 min
Cloud GPU (Shadeform, Lambda, RunPod)entrypoint.sh setup --cuda-arch XX~10 min
NVIDIA NIM / K8s containerbootstrap.sh --skip-deps --cuda-arch XX~8 min
NixOS (production)bootstrap-nix.sh nixos --gpu b200~5 min
Nix (dev shell, no install)bootstrap-nix.sh quick~2 min

Path 1 β€” Bare Linux (Ubuntu/Debian)​

Full setup in one command:

git clone git@github.com:VoidNxSEC/ml-ops-api.git
cd ml-ops-api

# Everything: deps + build + model + nice shell
./tensorforge/scripts/entrypoint.sh setup --model qwen2.5-coder-32b
./tensorforge/scripts/setup-shell.sh
exec zsh

Then:

./tensorforge/scripts/entrypoint.sh run \
--prompt "analyze this Rust codebase for security issues" \
--output result.json

Path 2 β€” Cloud GPU (Shadeform / Lambda / RunPod)​

git clone git@github.com:VoidNxSEC/ml-ops-api.git
cd ml-ops-api/tensorforge/scripts

# 1. Install system packages
sudo ./install-deps.sh

# 2. Build llama.cpp with CUDA
# GPU arch reference:
# L40S / L40 / RTX 4090 β†’ 89
# H100 / H200 / A100 β†’ 90 (H100=90, A100=80)
# B200 / B300 β†’ 100
# RTX 3090 / A6000 β†’ 86
# RTX 3080 / A40 β†’ 86
# T4 β†’ 75
sudo ./bootstrap.sh --cuda-arch 89 # L40S example

# 3. Nice shell
./setup-shell.sh && exec zsh

# 4. Download model
./model-pull.sh --model qwen2.5-coder-32b

# 5. Run
./server.sh start
./infer.sh "Hello from L40S"

If nvidia-smi fails (driver/library mismatch after toolkit install)​

# Try to fix without reboot
sudo rmmod nvidia_uvm && sudo modprobe nvidia_uvm
nvidia-smi

# If still broken β€” bypass nvidia-smi entirely
sudo ./bootstrap.sh --skip-deps --force-rebuild --cuda-arch 89

Path 3 β€” NVIDIA NIM / K8s Container​

NIM containers run as non-root. The scripts handle this automatically:

  • PREFIX resolves to ~/.tensorforge/llamacpp
  • sudo used only when available
  • mlock/no-mmap default to false (no CAP_IPC_LOCK in containers)
git clone git@github.com:VoidNxSEC/ml-ops-api.git
cd ml-ops-api/tensorforge/scripts

./bootstrap.sh --cuda-arch 100 # B200 = sm_100
./model-pull.sh --model llama-3.3-70b
./server.sh start

Batch inference fleet pattern ("entrar, inferir, exportar, fechar"):

# Prepare prompts
ls prompts/*.txt

# Run entire batch β€” server starts once, processes all, stops
./run.sh --batch prompts/ --output-dir results/

# Export all to markdown
./export.sh --dir results/ --format markdown --out-dir reports/

Path 4 β€” NixOS (Production)​

# Dry-run first to preview changes
sudo ./tensorforge/scripts/bootstrap-nix.sh nixos --gpu b200 --dry-run

# Apply
sudo ./tensorforge/scripts/bootstrap-nix.sh nixos --gpu b200

# Or for L40S:
sudo ./tensorforge/scripts/bootstrap-nix.sh nixos --gpu l40s --port 8080

This generates /etc/nixos/tensorforge.nix with the appropriate service config. Add to your configuration.nix:

imports = [ ./tensorforge.nix ];

Then rebuild:

sudo nixos-rebuild switch
systemctl status llamacpp-turbo
journalctl -fu llamacpp-turbo

B200 optimized module (direct)​

# /etc/nixos/configuration.nix
{
imports = [ ./modules/tensorforge/b200-optimized/default.nix ];

services.tensorforge.b200Optimized = {
enable = true;
llamacpp.enable = true; # -ngl 999, --flash-attn, ctx=65536
vllm.enable = false;
systemTuning.enable = true;
monitoring.enable = true; # Prometheus + Grafana
};
}

Path 5 β€” Nix Dev Shell (no install)​

Try llama.cpp without installing anything permanently:

./tensorforge/scripts/bootstrap-nix.sh quick
# β†’ drops into nix shell with llama-server available

Full dev environment (Rust + Python + llama.cpp):

./tensorforge/scripts/bootstrap-nix.sh dev
# β†’ nix develop from ml-ops-api/flake.nix

Script Reference​

tensorforge/scripts/
β”‚
β”œβ”€β”€ entrypoint.sh ← SINGLE ENTRYPOINT β€” use this
β”‚ commands:
β”‚ setup install deps + build + optional model pull
β”‚ run full pipeline (start β†’ infer β†’ stop)
β”‚ analyze arch-analyzer against project or platform
β”‚ server start/stop/restart/status/logs/gpu
β”‚ health GPU + server + benchmark
β”‚ status quick overview
β”‚ shell setup zsh + starship + aliases
β”‚
β”œβ”€β”€ bootstrap.sh bare Linux β€” build llama.cpp from source
β”‚ --cuda-arch XX bypass nvidia-smi (mismatch workaround)
β”‚ --skip-deps skip apt install
β”‚ --force-rebuild clean rebuild
β”‚
β”œβ”€β”€ bootstrap-nix.sh Nix / NixOS bootstrap
β”‚ quick nix shell with llama-server
β”‚ dev nix develop (full dev env)
β”‚ nixos install NixOS module
β”‚ install-nix install Nix via Determinate Systems
β”‚
β”œβ”€β”€ install-deps.sh all apt packages + Python venv
β”‚ --dry-run preview without installing
β”‚ --no-python skip Python packages
β”‚
β”œβ”€β”€ setup-shell.sh zsh + starship + fzf + bat + eza + aliases
β”‚ --no-starship skip starship prompt
β”‚
β”œβ”€β”€ server.sh manage llama-server process
β”œβ”€β”€ model-pull.sh download GGUF models from HuggingFace
β”œβ”€β”€ infer.sh single inference call
β”œβ”€β”€ run.sh full pipeline (used by entrypoint)
β”œβ”€β”€ export.sh JSON β†’ text/markdown/jsonl/csv
└── health.sh health check + benchmark

Model Registry​

NameVRAMBest For
qwen2.5-coder-7b8 GBFast code analysis
qwen2.5-coder-32b35 GBarch-analyzer default
mistral-24b26 GBBalanced
deepseek-r1-14b16 GBReasoning, fits 24 GB
llama-3.1-8b9 GBFast general purpose
llama-3.3-70b75 GBBest instruction following
deepseek-r1-70b75 GBStrong reasoning, B200
./tensorforge/scripts/model-pull.sh --list
./tensorforge/scripts/model-pull.sh --model qwen2.5-coder-32b
HF_TOKEN=hf_... ./tensorforge/scripts/model-pull.sh --model llama-3.3-70b

GPU Arch Reference​

GPUCUDA ArchFlag
NVIDIA B200 / B300sm_100--cuda-arch 100
NVIDIA H100 / H200sm_90--cuda-arch 90
NVIDIA A100sm_80--cuda-arch 80
NVIDIA L40S / L40 / RTX 4090sm_89--cuda-arch 89
NVIDIA RTX 3090 / A6000sm_86--cuda-arch 86
NVIDIA T4sm_75--cuda-arch 75

Environment Variables​

VarDefaultDescription
TF_PREFIX~/.tensorforge/llamacpp (non-root)Install prefix
TF_MODELS$TF_PREFIX/models/ggufGGUF model directory
LLAMACPP_URLhttp://127.0.0.1:8080Server endpoint
LLM_PARALLEL8Concurrent inference slots
LLM_TIMEOUT180Request timeout (seconds)
HF_TOKENβ€”HuggingFace token (private models)
MLOCKfalseLock model in RAM (bare-metal only)
NO_MMAPfalseDisable mmap (bare-metal only)
N_GPU_LAYERS999GPU layers (-1 = all)
CTX_SIZEautoContext window (auto = VRAM-based)

Platform Integration​

neoland control plane ──HTTP──► tensorforge (port 8080)
arch-analyzer ──HTTP──► LLAMACPP_URL=http://127.0.0.1:8080
cerebro RAG ──HTTP──► tensorforge

Run full platform analysis (all ~/master/ projects):

./tensorforge/scripts/entrypoint.sh analyze --platform
./tensorforge/scripts/entrypoint.sh analyze --platform --model llama-3.3-70b

# Single project
./tensorforge/scripts/entrypoint.sh analyze \
--project ~/master/neoland \
--template rust

Troubleshooting​

nvidia-smi: Failed to initialize NVML: Driver/library version mismatch​

Happens after installing cuda-toolkit without rebooting.

# Option 1: reload module
sudo rmmod nvidia_uvm && sudo modprobe nvidia_uvm
nvidia-smi

# Option 2: bypass detection (spot instance β€” can't reboot)
./bootstrap.sh --skip-deps --force-rebuild --cuda-arch 89

llama-server binary not found after build​

Usually caused by building with LLAMA_BUILD_EXAMPLES=OFF (fixed in current version). Force rebuild:

./bootstrap.sh --force-rebuild

mlock: Operation not permitted (containers)​

# Already defaulted to false. Explicitly:
MLOCK=false ./server.sh start

Server not responding after start​

./server.sh logs # check what's happening
./health.sh # GPU + server + benchmark
./server.sh status # PID + slots + model info

Maintained by: VoidNxSEC Team
GPU Program: NVIDIA Inception (B200/B300 fleet)
Cloud GPU: Shadeform / Lambda Labs / RunPod