Local LLM
Browse 19 Local LLM modes for AI coding agents — production-grounded, cited, installable. Part of the VIBE library.
exllama-awq-gptq-expert-mode
Quantize and serve LLMs on consumer GPUs with ExLlamaV2/V3 (EXL2/EXL3), AWQ, and GPTQ
View → modegguf-quantization-expert-mode
Convert HF safetensors to GGUF, run llama-imatrix, choose K-quants vs IQ-quants, and quantize models for llama.cpp
View → modejan-ai-expert-mode
Use Jan.ai open-source desktop assistant as a local LLM hub, OpenAI-compatible server on port 1337, and MCP host
View → modelitellm-proxy-expert-mode
Run LiteLLM as a unified gateway over local + cloud LLMs with router config, virtual keys, budgets, fallbacks, and Redis caching
View → modellama-cpp-expert-mode
Build, run, and tune llama.cpp for local LLM inference across CUDA, ROCm, Metal, Vulkan, and SYCL
View → modellama-cpp-server-expert-mode
Run llama.cpp's HTTP server with OpenAI-compatible endpoints, slots, multimodal, and reverse proxies
View → modellamafile-expert-mode
Build and run Mozilla llamafile single-file LLM executables with Cosmopolitan Libc / APE
View → modelm-studio-expert-mode
Run LM Studio with the lms CLI, headless llmster daemon, REST API, and MLX backend on Apple Silicon
View → modelocal-agent-runtime-expert-mode
Wire local-only agentic stacks — Continue.dev, Cline, Aider, Open Interpreter, Goose — to Ollama, LM Studio, llama-server, and Jan
View → modelocal-rag-stack-expert-mode
Build end-to-end local RAG with Chroma/LanceDB/Qdrant + nomic-embed/bge-m3/FastEmbed + llama-cpp-server or Ollama, all in Docker Compose
View → modelocalai-expert-mode
Self-host LocalAI (mudler) as an OpenAI/Anthropic/ElevenLabs drop-in for LLMs, vision, audio, image and embeddings on any hardware
View → modemlx-apple-silicon-expert-mode
Run, quantize, fine-tune (LoRA/QLoRA), and serve LLMs and VLMs natively on Apple Silicon with MLX and mlx-lm
View → modeollama-docker-deploy-expert-mode
Production self-host Ollama in Docker/Compose with GPU passthrough, model preload, reverse proxy auth, and multi-GPU
View → modesglang-expert-mode
Serve LLMs with SGLang's RadixAttention, structured outputs (compressed FSM), tensor parallel, DP-attention, and PD disaggregation
View → modeslm-deployment-expert-mode
Pick, quantize, and deploy sub-7B SLMs (Phi-4-mini, Qwen3 0.6-4B, Gemma 3 1B/4B, Llama 3.2 1B/3B, SmolLM3) to edge and constrained hardware
View → modetext-generation-webui-expert-mode
Run oobabooga's textgen with multiple backends, OpenAI/Anthropic-compatible API, characters, training tab, and portable installer
View → modetgi-huggingface-expert-mode
Deploy HuggingFace TGI in Docker with sharding, AWQ/GPTQ/EETQ/bitsandbytes quantization, and the OpenAI-compatible Messages API
View → modevllm-local-deploy-expert-mode
Self-host vLLM in Docker for high-throughput local inference with tensor parallelism, prefix caching, and AWQ/GPTQ quantization
View → modewhisper-cpp-expert-mode
Run whisper.cpp for local speech-to-text — model selection, CLI, HTTP server, real-time streaming, language detection
View →