Anthropic's model achieves a 78% score on SWE-Bench, significantly outperforming GPT-5 and Opus. This unexpected cybersecurity capability raises concerns about the potential threats posed by such models.
Mythos is fucking scaryβ¦.Anthropic built a model scoring 78% on SWE-Bench.
GPT-5 gets 57%. Opus gets 53%.
The cybersecurity ability wasnβt planned. It just emergedβ¦These types of models are legitimately a threat.
So they quietly patched with AWS, Google, Microsoft, and
A new benchmark reveals that GPT-5.4 leads at 28% in testing AI agents on real tax workflows, highlighting the challenges all models face in high-stakes, multi-step tasks. This insight could inform future model development and evaluation criteria.
We finally have a benchmark that tests AI agents on real tax workflows.
GPT-5.4 is leading at 28% but all models still su**xs on high-stakes, multi-step tasks.
New model cards should have benchmarks like this in future.