Multimodal AI
Browse 19 Multimodal AI modes for AI coding agents — production-grounded, cited, installable. Part of the VIBE library.
animatediff-svd-expert-mode
AnimateDiff motion modules + SVD image-to-video, frame interpolation, video LoRAs
View → modecog-video-expert-mode
CogVideoX, Mochi-1, Hunyuan, LTX video diffusion - training and inference patterns
View → modecomfyui-api-expert-mode
ComfyUI as backend - API mode, websocket polling, queue management for production
View → modecomfyui-expert-mode
ComfyUI graph design, custom nodes, workflow JSON, queue, API integration
View → modecontrolnet-expert-mode
ControlNet variants - canny, depth, openpose, lineart, tile, inpaint - and multi-controlnet stacking
View → modediffusers-library-expert-mode
HF diffusers - pipelines, schedulers, IP-Adapter loading, LoRA loading, custom model loading
View → modeelevenlabs-expert-mode
ElevenLabs TTS, voice cloning, conversational AI, sound effects, music
View → modefal-ai-expert-mode
fal.ai serverless inference for image/video models - queue + webhook patterns
View → modeflux-expert-mode
Black Forest Labs FLUX.1 image generation - dev/schnell/pro, ControlNet, LoRA training (ai-toolkit, simpletuner)
View → modeip-adapter-expert-mode
IP-Adapter for image-conditioned generation - plus, face ID, full-face, instant-style
View → modemultimodal-embedding-expert-mode
Multimodal embeddings - jina-clip-v2, voyage-multimodal-3, ColPali, nomic-embed-multimodal
View → modeocr-vlm-expert-mode
OCR with VLMs - Mistral OCR, Surya, GOT-OCR2.0 - and PDF parsing pipelines (Marker, Docling, Unstructured)
View → modesd3-expert-mode
SD3 / SD3.5 Large, MMDiT architecture, T5-XXL prompting, differences from SDXL
View → modesdxl-expert-mode
Stable Diffusion XL - base + refiner, LoRA, IP-Adapter, samplers, schedulers
View → modesuno-udio-music-expert-mode
AI music gen patterns - Suno, Udio, Stable Audio, MusicGen, ACE-Step, YuE
View → modevideo-vlm-expert-mode
Video understanding with VLMs - Qwen2.5-VL video, Apollo, LLaVA-OneVision, frame sampling
View → modevision-llm-expert-mode
VLM landscape - Claude, GPT-4o, Llama 3.2 Vision, Qwen2.5-VL, Pixtral, MiniCPM-V, InternVL
View → modewhisper-expert-mode
Whisper variants - large-v3, faster-whisper, distil-whisper, whisper-cpp - VAD, diarization, real-time
View → modextts-coqui-expert-mode
Self-hosted voice cloning - XTTS-v2, Coqui TTS, F5-TTS, StyleTTS2, Kokoro, Chatterbox
View →