Anthropic's model achieves a 78% score on SWE-Bench, significantly outperforming GPT-5 and Opus. This unexpected cybersecurity capability raises concerns about the potential threats posed by such models.
Mythos is fucking scaryβ¦.Anthropic built a model scoring 78% on SWE-Bench.
GPT-5 gets 57%. Opus gets 53%.
The cybersecurity ability wasnβt planned. It just emergedβ¦These types of models are legitimately a threat.
So they quietly patched with AWS, Google, Microsoft, and
Flowise has been identified as the fourth agent framework with a critical CVSS 10.0 vulnerability, already being exploited in the wild. This highlights ongoing security issues in AI tools that builders need to be aware of.
Flowise just became the fourth agent framework caught shipping unsandboxed code execution into production. This time it's CVSS 10.0 β maximum severity β and VulnCheck confirms attackers are already exploiting it from the wild.
The vulnerability is almost insultingly simple.
Meta claims Muse Spark achieves top-five global benchmarks using significantly less compute than Llama 4 Maverick, challenging the notion that advanced AI requires extensive infrastructure investment. This could indicate a shift in how AI systems are built and deployed.
Meta built Muse Spark using over 10x less compute than Llama 4 Maverick.
Top-five globally on benchmarks. Fraction of the training cost.
Efficiency curves compressing this fast changes the underlying assumption that frontier AI requires frontier infrastructure spend.
The labs
The tweet discusses the performance of Llama 3 and Phi-4 compared to GPT-3.5 and GPT-4o, highlighting significant efficiency and capability improvements. Senior engineers may find the benchmarks relevant for evaluating model performance and infrastructure requirements.
GPT-3.5 had 175 billion parameters.
Llama 3 matched it with 8 billion. That is 20x fewer.
Phi-4 has 14 billion parameters. It outperforms GPT-4o on math and graduate-level science benchmarks. A model that runs on a laptop beating one that needs a datacenter.
The pattern is
Gemini 3.1 Pro outperforms most competitors in benchmarks and ties with GPT-5.4 Pro on a key index, all at a significantly lower cost. This indicates a strong competitive position for Google in the AI landscape, which may influence future development strategies.
Gemini 3.1 Pro leads 13 of 16 major benchmarks right now. it ties GPT-5.4 Pro on the Artificial Analysis Intelligence Index. it costs roughly a third of the price. Google is winning the benchmark race and the cost race simultaneously. the discourse is still OpenAI vs Anthropic.
π 0 viewsβ€ 0π 0π¬ 0π 00.0% eng
AI benchmarksGemini 3.1 ProGoogleGPT-5.4 Promarket trends
A new benchmark reveals that GPT-5.4 leads at 28% in testing AI agents on real tax workflows, highlighting the challenges all models face in high-stakes, multi-step tasks. This insight could inform future model development and evaluation criteria.
We finally have a benchmark that tests AI agents on real tax workflows.
GPT-5.4 is leading at 28% but all models still su**xs on high-stakes, multi-step tasks.
New model cards should have benchmarks like this in future.