AI Scanner — 2026-04-10

market signal @tech__unicorn

7/10

Anthropic's Model Scores High on SWE-Bench

Anthropic's model achieves a 78% score on SWE-Bench, significantly outperforming GPT-5 and Opus. This unexpected cybersecurity capability raises concerns about the potential threats posed by such models.

Mythos is fucking scary….Anthropic built a model scoring 78% on SWE-Bench. GPT-5 gets 57%. Opus gets 53%. The cybersecurity ability wasn’t planned. It just emerged…These types of models are legitimately a threat. So they quietly patched with AWS, Google, Microsoft, and

👁 224 views ❤ 3 🔁 0 💬 0 🔖 0 1.3% eng

AIAnthropicSWE-Benchcybersecuritybenchmarking

market signal @ai_for_success

7/10

Benchmark for AI Agents in Tax Workflows

A new benchmark reveals that GPT-5.4 leads at 28% in testing AI agents on real tax workflows, highlighting the challenges all models face in high-stakes, multi-step tasks. This insight could inform future model development and evaluation criteria.

We finally have a benchmark that tests AI agents on real tax workflows. GPT-5.4 is leading at 28% but all models still su**xs on high-stakes, multi-step tasks. New model cards should have benchmarks like this in future.

👁 1,513 views ❤ 12 🔁 0 💬 2 🔖 2 0.9% eng

AIbenchmarktax workflowsGPT-5.4model evaluation

AI Twitter Scanner