mode LLM Eval Ops
lm-eval-harness-expert-mode
EleutherAI lm-evaluation-harness — MMLU, ARC, HellaSwag, GSM8K, IFEval, BBH benchmarks
More in LLM Eval Ops
mode
arize-phoenix-expert-mode
Open-source LLM tracing and evaluation built on OpenInference and OpenTelemetry
View → modecanary-llm-deploy-expert-mode
Safe LLM deploys — canary, shadow traffic, rollback triggers, eval-gated promotion
View → modedeepeval-expert-mode
DeepEval (Confident AI) — pytest-native LLM evals with G-Eval, Hallucination, Toxicity, Bias
View → modehelicone-expert-mode
Helicone proxy/observability — cost tracking, semantic caching, rate limits, prompt versioning
View → modelangfuse-expert-mode
Self-hostable open-source LLM observability with tracing, scoring, datasets, and prompt management
View → modelangsmith-expert-mode
LangChain's hosted LLM observability and evaluation platform — traces, datasets, evaluators, hub
View →