A new benchmark reveals that GPT-5.4 leads at 28% in testing AI agents on real tax workflows, highlighting the challenges all models face in high-stakes, multi-step tasks. This insight could inform future model development and evaluation criteria.
We finally have a benchmark that tests AI agents on real tax workflows.
GPT-5.4 is leading at 28% but all models still su**xs on high-stakes, multi-step tasks.
New model cards should have benchmarks like this in future.