mode LLM Eval Ops

lm-eval-harness-expert-mode

EleutherAI lm-evaluation-harness — MMLU, ARC, HellaSwag, GSM8K, IFEval, BBH benchmarks

KindMode

CategoryLLM Eval Ops

Installnpx -y github:anubhavg-icpl/vibe add lm-eval-harness-expert-mode

LicenseCC BY-NC-SA 4.0

Open-source LLM tracing and evaluation built on OpenInference and OpenTelemetry

Safe LLM deploys — canary, shadow traffic, rollback triggers, eval-gated promotion

DeepEval (Confident AI) — pytest-native LLM evals with G-Eval, Hallucination, Toxicity, Bias

Helicone proxy/observability — cost tracking, semantic caching, rate limits, prompt versioning

Self-hostable open-source LLM observability with tracing, scoring, datasets, and prompt management

LangChain's hosted LLM observability and evaluation platform — traces, datasets, evaluators, hub

More in LLM Eval Ops