LLM evaluation is a minefield, and it turns out that agent evaluation has a bunch of additional pitfalls