◆ Category · 19 assets

Local LLM

Browse 19 Local LLM modes for AI coding agents — production-grounded, cited, installable. Part of the VIBE library.

mode

exllama-awq-gptq-expert-mode

Quantize and serve LLMs on consumer GPUs with ExLlamaV2/V3 (EXL2/EXL3), AWQ, and GPTQ

View →
mode

gguf-quantization-expert-mode

Convert HF safetensors to GGUF, run llama-imatrix, choose K-quants vs IQ-quants, and quantize models for llama.cpp

View →
mode

jan-ai-expert-mode

Use Jan.ai open-source desktop assistant as a local LLM hub, OpenAI-compatible server on port 1337, and MCP host

View →
mode

litellm-proxy-expert-mode

Run LiteLLM as a unified gateway over local + cloud LLMs with router config, virtual keys, budgets, fallbacks, and Redis caching

View →
mode

llama-cpp-expert-mode

Build, run, and tune llama.cpp for local LLM inference across CUDA, ROCm, Metal, Vulkan, and SYCL

View →
mode

llama-cpp-server-expert-mode

Run llama.cpp's HTTP server with OpenAI-compatible endpoints, slots, multimodal, and reverse proxies

View →
mode

llamafile-expert-mode

Build and run Mozilla llamafile single-file LLM executables with Cosmopolitan Libc / APE

View →
mode

lm-studio-expert-mode

Run LM Studio with the lms CLI, headless llmster daemon, REST API, and MLX backend on Apple Silicon

View →
mode

local-agent-runtime-expert-mode

Wire local-only agentic stacks — Continue.dev, Cline, Aider, Open Interpreter, Goose — to Ollama, LM Studio, llama-server, and Jan

View →
mode

local-rag-stack-expert-mode

Build end-to-end local RAG with Chroma/LanceDB/Qdrant + nomic-embed/bge-m3/FastEmbed + llama-cpp-server or Ollama, all in Docker Compose

View →
mode

localai-expert-mode

Self-host LocalAI (mudler) as an OpenAI/Anthropic/ElevenLabs drop-in for LLMs, vision, audio, image and embeddings on any hardware

View →
mode

mlx-apple-silicon-expert-mode

Run, quantize, fine-tune (LoRA/QLoRA), and serve LLMs and VLMs natively on Apple Silicon with MLX and mlx-lm

View →
mode

ollama-docker-deploy-expert-mode

Production self-host Ollama in Docker/Compose with GPU passthrough, model preload, reverse proxy auth, and multi-GPU

View →
mode

sglang-expert-mode

Serve LLMs with SGLang's RadixAttention, structured outputs (compressed FSM), tensor parallel, DP-attention, and PD disaggregation

View →
mode

slm-deployment-expert-mode

Pick, quantize, and deploy sub-7B SLMs (Phi-4-mini, Qwen3 0.6-4B, Gemma 3 1B/4B, Llama 3.2 1B/3B, SmolLM3) to edge and constrained hardware

View →
mode

text-generation-webui-expert-mode

Run oobabooga's textgen with multiple backends, OpenAI/Anthropic-compatible API, characters, training tab, and portable installer

View →
mode

tgi-huggingface-expert-mode

Deploy HuggingFace TGI in Docker with sharding, AWQ/GPTQ/EETQ/bitsandbytes quantization, and the OpenAI-compatible Messages API

View →
mode

vllm-local-deploy-expert-mode

Self-host vLLM in Docker for high-throughput local inference with tensor parallelism, prefix caching, and AWQ/GPTQ quantization

View →
mode

whisper-cpp-expert-mode

Run whisper.cpp for local speech-to-text — model selection, CLI, HTTP server, real-time streaming, language detection

View →