The Quest for Tasks Frontier Models don't one-shot.

Explore the challenges of creating complex evaluation environments for frontier models, discussing successes, failures, and pitfalls to avoid.

Overview

We contributed a bunch of evaluation environments to terminal-bench-3 and I’d like to talk about the workflow of creating complex environments, the difficulties along the way (frontier models are just too good - but also not goot enough), and how terminal-bench-3 approaches this crux.

Tech stack