◆ Category · 19 assets

Multimodal AI

Browse 19 Multimodal AI modes for AI coding agents — production-grounded, cited, installable. Part of the VIBE library.

mode

animatediff-svd-expert-mode

AnimateDiff motion modules + SVD image-to-video, frame interpolation, video LoRAs

View →
mode

cog-video-expert-mode

CogVideoX, Mochi-1, Hunyuan, LTX video diffusion - training and inference patterns

View →
mode

comfyui-api-expert-mode

ComfyUI as backend - API mode, websocket polling, queue management for production

View →
mode

comfyui-expert-mode

ComfyUI graph design, custom nodes, workflow JSON, queue, API integration

View →
mode

controlnet-expert-mode

ControlNet variants - canny, depth, openpose, lineart, tile, inpaint - and multi-controlnet stacking

View →
mode

diffusers-library-expert-mode

HF diffusers - pipelines, schedulers, IP-Adapter loading, LoRA loading, custom model loading

View →
mode

elevenlabs-expert-mode

ElevenLabs TTS, voice cloning, conversational AI, sound effects, music

View →
mode

fal-ai-expert-mode

fal.ai serverless inference for image/video models - queue + webhook patterns

View →
mode

flux-expert-mode

Black Forest Labs FLUX.1 image generation - dev/schnell/pro, ControlNet, LoRA training (ai-toolkit, simpletuner)

View →
mode

ip-adapter-expert-mode

IP-Adapter for image-conditioned generation - plus, face ID, full-face, instant-style

View →
mode

multimodal-embedding-expert-mode

Multimodal embeddings - jina-clip-v2, voyage-multimodal-3, ColPali, nomic-embed-multimodal

View →
mode

ocr-vlm-expert-mode

OCR with VLMs - Mistral OCR, Surya, GOT-OCR2.0 - and PDF parsing pipelines (Marker, Docling, Unstructured)

View →
mode

sd3-expert-mode

SD3 / SD3.5 Large, MMDiT architecture, T5-XXL prompting, differences from SDXL

View →
mode

sdxl-expert-mode

Stable Diffusion XL - base + refiner, LoRA, IP-Adapter, samplers, schedulers

View →
mode

suno-udio-music-expert-mode

AI music gen patterns - Suno, Udio, Stable Audio, MusicGen, ACE-Step, YuE

View →
mode

video-vlm-expert-mode

Video understanding with VLMs - Qwen2.5-VL video, Apollo, LLaVA-OneVision, frame sampling

View →
mode

vision-llm-expert-mode

VLM landscape - Claude, GPT-4o, Llama 3.2 Vision, Qwen2.5-VL, Pixtral, MiniCPM-V, InternVL

View →
mode

whisper-expert-mode

Whisper variants - large-v3, faster-whisper, distil-whisper, whisper-cpp - VAD, diarization, real-time

View →
mode

xtts-coqui-expert-mode

Self-hosted voice cloning - XTTS-v2, Coqui TTS, F5-TTS, StyleTTS2, Kokoro, Chatterbox

View →