Stanford's research reveals that leading AI models like GPT-5 and Google Gemini maintain high accuracy without images, highlighting a significant flaw in AI vision systems. This finding could prompt engineers to reassess model reliability in real-world applications.
Holy shit⦠Stanford University just exposed a massive flaw in AI vision.
GPT-5, Google Gemini, and Claude scored 70β80% accuracyβ¦ with no images at all.
They call it the βmirage effectβ β
β Researchers removed images from 6 major benchmarks
β Models kept answering like
π 932 viewsβ€ 10π 6π¬ 3π 22.0% eng
AI researchvision systemsStanfordGPT-5Google Gemini
ConvApparel is a new dataset aimed at improving LLM-based user simulators by quantifying the 'realism gap.' This could be relevant for engineers focused on enhancing conversational agent training methodologies.
Introducing ConvApparel, a new human-AI conversation dataset, as well as a comprehensive evaluation framework designed to quantify the "realism gap" in LLM-based user simulators and improve the training of robust conversational agents.
Read all about it β
goo.gle/41k5eff
The Memory Intelligence Agent (MIA) proposes a new architecture that enhances 7B models to outperform GPT-5.4 through a Manager-Planner-Executor framework with continual learning. This could be of interest to engineers looking for novel strategies in AI model development.
MIA: Memory Intelligence Agent
Evolves deep research agents from passive record-keepers into active strategists, enabling 7B models to outperform GPT-5.4 via a Manager-Planner-Executor architecture with continual test-time learning.
π 1,897 viewsβ€ 43π 15π¬ 2π 193.2% eng
Announcement of a research presentation on AI's role in security, specifically focusing on a project called 'HTTP Terminator.' Senior engineers may find the insights relevant for understanding AI's application in security contexts.
I'm thrilled to announce "Can AI Do Novel Security Research? Meet the HTTP Terminator" will premiere at
@BlackHatEvents
#BHUSA! Check out the abstract:
π 8,260 viewsβ€ 181π 32π¬ 8π 552.7% eng
This tweet discusses a new method presented at NLP2026 for resolving notation variations in medical department names using an LLM, achieving a high accuracy rate. Senior engineers may find the approach and results relevant for improving NLP applications in healthcare.
Published a new article on the KAKEHASHI Tech Blog.
We presented at NLP2026 a method that resolves "notation variations" in medical department names using an LLM, achieving a 97.5% accuracy rate with GPT-5. Please take a look.
Anthropic's new research explores using a weak AI model to supervise the training of a stronger one, potentially accelerating alignment research. This could have implications for how AI systems are developed and aligned in the future.
New Anthropic Fellows research: developing an Automated Alignment Researcher.
We ran an experiment to learn whether Claude Opus 4.6 could accelerate research on a key alignment problem: using a weak AI model to supervise the training of a stronger one.
π 11,980 viewsβ€ 252π 47π¬ 21π 882.7% eng
AI alignmentresearchAnthropicClaude Opusmachine learning