Members-Only
Recent Talks & Demos are for members only
You must be an AI Tinkerers active member to view these talks and demos.
terminal-bench-3: Evaluating Frontier Models
Explore the challenges of creating complex evaluation environments for frontier models, discussing successes, failures, and pitfalls to avoid.
We contributed a bunch of evaluation environments to terminal-bench-3 and I’d like to talk about the workflow of creating complex environments, the difficulties along the way (frontier models are just too good - but also not goot enough), and how terminal-bench-3 approaches this crux.