LLM Eval Ops
Browse 18 LLM Eval Ops modes for AI coding agents — production-grounded, cited, installable. Part of the VIBE library.
arize-phoenix-expert-mode
Open-source LLM tracing and evaluation built on OpenInference and OpenTelemetry
View → modecanary-llm-deploy-expert-mode
Safe LLM deploys — canary, shadow traffic, rollback triggers, eval-gated promotion
View → modedeepeval-expert-mode
DeepEval (Confident AI) — pytest-native LLM evals with G-Eval, Hallucination, Toxicity, Bias
View → modehelicone-expert-mode
Helicone proxy/observability — cost tracking, semantic caching, rate limits, prompt versioning
View → modelangfuse-expert-mode
Self-hostable open-source LLM observability with tracing, scoring, datasets, and prompt management
View → modelangsmith-expert-mode
LangChain's hosted LLM observability and evaluation platform — traces, datasets, evaluators, hub
View → modellm-cost-expert-mode
Token economics, prompt caching, model routing — engineering LLM apps for sustainable spend
View → modelm-eval-harness-expert-mode
EleutherAI lm-evaluation-harness — MMLU, ARC, HellaSwag, GSM8K, IFEval, BBH benchmarks
View → modemlflow-llm-expert-mode
MLflow Tracing for LLMs, Prompt Engineering UI, mlflow.evaluate(), prompt registry
View → modemodel-card-expert-mode
Authoring HuggingFace Model Cards, NIST AI RMF / Inspect AI eval reports, transparency notes
View → modeopenai-evals-expert-mode
openai/evals framework — registry layout, model-graded patterns, custom YAML evals
View → modeopentelemetry-llm-expert-mode
OpenTelemetry GenAI semantic conventions — standardized LLM spans for any vendor
View → modeprompt-management-expert-mode
Versioned prompt registries, A/B rollouts, env-aware config across Langfuse, LangSmith, Promptfoo
View → modepromptfoo-expert-mode
Promptfoo CLI for systematic prompt testing, model comparison, and red-team plugins
View → moderagas-expert-mode
RAGAS metrics for RAG and agent evaluation — faithfulness, relevancy, context precision/recall
View → moderedteam-llm-expert-mode
Jailbreak suites, garak, PyRIT, Promptfoo red-team — adversarial testing for LLM apps
View → modesemantic-cache-expert-mode
GPTCache, Helicone cache, LangChain semantic cache — embedding-based dedup for LLM apps
View → modewandb-prompts-expert-mode
Weights & Biases Weave — trace agents, log datasets, run evaluations, compare runs
View →