◆ Category · 18 assets

LLM Eval Ops

Browse 18 LLM Eval Ops modes for AI coding agents — production-grounded, cited, installable. Part of the VIBE library.

mode

arize-phoenix-expert-mode

Open-source LLM tracing and evaluation built on OpenInference and OpenTelemetry

View →
mode

canary-llm-deploy-expert-mode

Safe LLM deploys — canary, shadow traffic, rollback triggers, eval-gated promotion

View →
mode

deepeval-expert-mode

DeepEval (Confident AI) — pytest-native LLM evals with G-Eval, Hallucination, Toxicity, Bias

View →
mode

helicone-expert-mode

Helicone proxy/observability — cost tracking, semantic caching, rate limits, prompt versioning

View →
mode

langfuse-expert-mode

Self-hostable open-source LLM observability with tracing, scoring, datasets, and prompt management

View →
mode

langsmith-expert-mode

LangChain's hosted LLM observability and evaluation platform — traces, datasets, evaluators, hub

View →
mode

llm-cost-expert-mode

Token economics, prompt caching, model routing — engineering LLM apps for sustainable spend

View →
mode

lm-eval-harness-expert-mode

EleutherAI lm-evaluation-harness — MMLU, ARC, HellaSwag, GSM8K, IFEval, BBH benchmarks

View →
mode

mlflow-llm-expert-mode

MLflow Tracing for LLMs, Prompt Engineering UI, mlflow.evaluate(), prompt registry

View →
mode

model-card-expert-mode

Authoring HuggingFace Model Cards, NIST AI RMF / Inspect AI eval reports, transparency notes

View →
mode

openai-evals-expert-mode

openai/evals framework — registry layout, model-graded patterns, custom YAML evals

View →
mode

opentelemetry-llm-expert-mode

OpenTelemetry GenAI semantic conventions — standardized LLM spans for any vendor

View →
mode

prompt-management-expert-mode

Versioned prompt registries, A/B rollouts, env-aware config across Langfuse, LangSmith, Promptfoo

View →
mode

promptfoo-expert-mode

Promptfoo CLI for systematic prompt testing, model comparison, and red-team plugins

View →
mode

ragas-expert-mode

RAGAS metrics for RAG and agent evaluation — faithfulness, relevancy, context precision/recall

View →
mode

redteam-llm-expert-mode

Jailbreak suites, garak, PyRIT, Promptfoo red-team — adversarial testing for LLM apps

View →
mode

semantic-cache-expert-mode

GPTCache, Helicone cache, LangChain semantic cache — embedding-based dedup for LLM apps

View →
mode

wandb-prompts-expert-mode

Weights & Biases Weave — trace agents, log datasets, run evaluations, compare runs

View →