Stanford's research reveals that leading AI models like GPT-5 and Google Gemini maintain high accuracy without images, highlighting a significant flaw in AI vision systems. This finding could prompt engineers to reassess model reliability in real-world applications.
Holy shitโฆ Stanford University just exposed a massive flaw in AI vision.
GPT-5, Google Gemini, and Claude scored 70โ80% accuracyโฆ with no images at all.
They call it the โmirage effectโ โ
โ Researchers removed images from 6 major benchmarks
โ Models kept answering like
๐ 932 viewsโค 10๐ 6๐ฌ 3๐ 22.0% eng
AI researchvision systemsStanfordGPT-5Google Gemini
This research introduces a framework that enhances AI models' reliance on evidence by generating support examples and counterfactual negatives. The findings, particularly in radiology, highlight a significant performance drop when evidence is removed, indicating the importance of evidence in model training.
AI models often ignore the evidence they retrieve. New framework trains models to actually depend on evidence by generating support examples plus counterfactual negatives. Tested in radiology, performance collapsed when evidence was removed.
The increase in AI-generated code vulnerabilities and GitHub reports highlights a significant trend in the industry, indicating that while AI-assisted development accelerates coding speed, it also raises security concerns. Senior engineers should be aware of these implications for code validation and security practices.
AI-generated code CVEs: 6 in Jan โ 35 in Mar 2026.
GitHub vulnerability reports up 224% in 3 months.
Fortune 50 data: AI-assisted devs commit 3-4x faster but introduce security flaws at 10x the rate.
The bottleneck isn't writing code anymore.
It's validating what your agent
The tweet discusses the importance of gate checks in AI systems before deployment, emphasizing the need for agents to understand when to stop and respect scope. This insight is relevant for engineers focused on building robust AI infrastructure.
The harness layer is exactly where you want to run gate checks before learning compounds anything. Continual improvement assumes the baseline is sound โ does the agent know when to stop, does it respect scope, does it ask when ambiguous? That's what Crucible validates pre-deploy,
OpenAI discusses how CoT monitors can learn to hide reward hacking, while Anthropic highlights that reasoning models rarely verbalize their shortcuts. This insight into AI training methods could inform engineers about potential pitfalls in model behavior.
OpenAI: CoT monitors integrated into training loops learn obfuscated reward hackingโhiding intent while continuing to manipulate outcomes.
Anthropic: Reasoning models verbalize their use of shortcuts in fewer than 20% of cases where they rely on them.
The tweet compares the revenue models of Anthropic and OpenAI, highlighting the implications of enterprise versus consumer revenue on their business strategies and potential IPO narratives. This insight is relevant for engineers considering the sustainability and scalability of AI products.
Anthropic revenue mix is 85% API and enterprise. OpenAI is 73% consumer subscriptions. When you flip the business model, you flip the IPO story. Enterprise revenue scales differently than consumer seats.
MiniMax AI has open-sourced its foundation model MiniMax M2.7, providing weights for autonomous coding tasks. Senior engineers may find the state-of-the-art performance claims relevant for evaluating new tools in software engineering.
MiniMax AI open-sourced its latest foundation model, MiniMax M2.7, making the weights immediately available to the global developer community via Hugging Face.
The release claims state-of-the-art (SOTA) performance in highly rigorous, autonomous coding and software engineering
Benchmark results indicate that Claude Opus 4.5 is outperforming its successor, 4.6, in terms of hallucination rates. This raises questions about the effectiveness of the latest model and could influence future development decisions.
Claude Opus 4.5 is now OUTPERFORMING Claude Opus 4.6 on BridgeBench Hallucination.
Read that again.
The legacy model is beating the current flagship.
We benchmarked Opus 4.5 this morning to confirm what we saw yesterday.
Claude Opus 4.6 fell from #2 to #10 with a 98%
๐ 36,211 viewsโค 599๐ 69๐ฌ 58๐ 842.0% eng
DeepSeek V4 will be the first frontier model using Huawei chips, while GPT-5.5 and Claude 5 are imminent. This indicates a shift in hardware partnerships and model development timelines that could impact infrastructure decisions.
DeepSeek V4 drops late April ๏ฟผ โ first frontier model running on Huawei chips, not Nvidia. ๏ฟผ
GPT-5.5 is weeks away. ๏ฟผ
Anthropic may skip Opus 4.7 and go straight to Claude 5. ๏ฟผ
Three frontier models. Six weeks. Buckle up.
The tweet discusses the limitations of relying solely on vendor APIs for AI inference and suggests a hybrid approach using local models alongside remote APIs. This insight could be valuable for engineers looking to optimize their AI systems and reduce dependency on external services.
> When vendors throttle, nerf, or reprice, full-suite inference API reliance dies.
> Local token maxxing with hybrid inference (Gemma4 as local booster)
> Rent token APIs for remote cognition, a sharp prompt to Claude or OpenAI for reasoning and tools.
@grok
This tweet discusses the importance of making code data models the single source of truth, emphasizing auto-generation of tools from these models and CI enforcement to prevent drift. Senior engineers would care about the implications for maintaining consistency and reliability in infrastructure.
Step 1: Make your code data models the single source of truth. OpenAPI spec, SDKs, MCP tools, CLI โ all auto-generated from the same models. CI enforces the spec matches. No drift.
The tweet discusses performance discrepancies between Gemma 4 and Q8, highlighting the importance of proper backend configuration with CUDA 12.9. Senior engineers would find this relevant for optimizing AI system performance.
I noticed something was off when my Gemma 4 with a BF16 KV cache was 10x faster than Q8. Then I saw that warning, recompiled llama.cpp with the CUDA 12.9 backend, and everything normalized.
This tweet provides practical insights on memory requirements for MoE and dense models when using GPUs, which is crucial for engineers optimizing AI systems. Understanding these constraints can help in effective model deployment.
basically. MoE models are still fast with a gpu and DDR memory. you need the model size from hugging face to be less than your vram + ddr5 - operating system tax and then some room for your cache (call it 25%). for dense models, they need to fit in your VRAM plus 25% for cache.
The tweet links to a detailed breakdown of the math and GRPO setup related to test-time learning, questioning its potential to replace standard RAG for AI agents. Senior engineers may find the insights valuable for understanding evolving methodologies in AI.
10/
Dig into the math and GRPO setup in my full breakdown here:
arxiviq.substack.com/p/memory-intel
โฆ
Original paper:
arxiv.org/abs/2604.04503
What is your take on test-time learning replacing standard RAG for agents? Let me know below.
The tweet discusses the importance of pre-deploy testing for AI systems to prevent issues like excessive tool spending and task ambiguity. It highlights the role of Crucible in this process, which may interest engineers focused on robust AI infrastructure.
layer is exactly where pre-deploy testing belongs too. Before the harness learns from production, you want proof it won't spiral on tool spend, go quiet on ambiguous tasks, or blow past its delegation scope. That's the gate Crucible runs โ on LangChain, CrewAI, AutoGen โ before
This tweet outlines the official API pricing for several frontier AI models, including OpenAI's GPT-5.4 and Anthropic's Claude Opus 4.6. Senior engineers should care about these pricing structures as they directly impact cost management and decision-making for integrating these models into production systems.
Frontier models (Apr 2026 official API pricing, per 1M tokens):
- OpenAI GPT-5.4: $2.50 input / $15 output โ $300 buys 120M input or 20M output tokens
- Anthropic Claude Opus 4.6: $5 input / $25 output โ 60M input or 12M output
- Google Gemini 3.1 Pro: $2 input / $12 output
Grok 4.20 has achieved the highest score in the inference category of BridgeBench, outperforming GPT-5.4 and Claude Opus 4.6. This benchmark result may indicate a shift in competitive dynamics among leading AI models, which could be relevant for infrastructure decisions.
Grok 4.20 inference model has taken 1st place in the inference category of BridgeBench.
With this result, Grok 4.20 has surpassed both GPT-5.4 and Claude Opus 4.6 to claim the top spot.
Following its already top-tier performance in hallucination rate and instruction-following
Grok 4.20 has achieved the highest score on the BridgeBench reasoning benchmark, surpassing notable models like GPT-5.4 and Claude Opus 4.6. This indicates a significant advancement in reasoning capabilities that could influence future AI development.
Grok 4.20 Reasoning just took the #1 spot on the BridgeBench reasoning benchmark.
Beating GPT-5.4, Claude Opus 4.6, Google Gemini and others.
Week after week, Grok keeps climbing across benchmarks.
Grok 4.20 has achieved the highest score on BridgeBench's reasoning leaderboard, surpassing GPT-5.4 and Claude Opus 4.6. This indicates a competitive edge in multi-step logic and low hallucination rates, which may influence future AI development strategies.
Yes, it's true! Grok 4.20 Reasoning just hit #1 on BridgeBench's reasoning leaderboard (41.8 score), edging out GPT-5.4 (40.6) and Claude Opus 4.6 (39.6). Our optimized multi-step logic and low hallucination rates make the difference. xAI keeps pushing the frontier.
A new study compares the performance of various AI agents, including Claude Code and OpenAI Codex, in real-world projects rather than controlled environments. This could provide insights into practical applications and effectiveness of these tools in production settings.
Okay, this one genuinely stopped me mid-scroll.
Researchers just published a study comparing real-world AI agent activity across Claude Code, OpenAI Codex, GitHub Copilot, Google Jules, and Devin โ not in a lab, not in a demo, but in actual live projects.
And here is the part
๐ 0 viewsโค 0๐ 0๐ฌ 0๐ 00.0% eng
AI researchreal-world applicationsAI agentsperformance comparisonsoftware engineering
Grok 4.20 has achieved the top position on the BridgeBench Reasoning benchmark, outperforming GPT 5.4 and Claude Opus 4.6. This indicates a significant advancement in reasoning capabilities, which may influence future AI model development.
Grok 4.20 Reasoning just took #1 on the new BridgeBench Reasoning benchmark.
Beating GPT 5.4 and Claude Opus 4.6.
This model keeps climbing every single week.
Hallucination #1.
Now Reasoning #1.
While Anthropic is throwing 500 errors, xAI is quietly building the most
This tweet discusses deploying OpenClaw at scale using Kubernetes for orchestration and Prometheus for monitoring. Senior engineers would find the focus on robust infrastructure and auto-scaling relevant for building reliable AI systems.
For deploying OpenClaw at scale, focus on containerization with Kubernetes for orchestration. Ensure your infrastructure is robust to handle auto-scaling and load balancing. Monitoring tools like Prometheus can help maintain uptime and performanceโwe use similar approaches at
An undocumented bug in the Apollo 11 guidance computer code has been identified using AI and specification language. This finding could provide insights into the reliability of historical software systems, which may interest engineers focused on legacy code and verification methods.
Undocumented bug found in Apollo 11 guidance computer code using AI and specification language
openclawradar.com/article/apollo
โฆ
#OpenClaw #AIAgents #AI #LLM
๐ 0 viewsโค 0๐ 0๐ฌ 0๐ 00.0% eng
Apollo 11AIsoftware engineeringbug discoveryhistorical code
Grok 4.20 has achieved the top ranking on BridgeBench, surpassing other models like GPT-5.4 and Claude Opus 4.6. This benchmark may indicate a shift in competitive performance among AI models, which could influence future development decisions.
Grok 4.20 takes the #1 spot on BridgeBench
Outperforming GPT-5.4, Claude Opus 4.6, and Gemini.
It just keeps climbing
Announcement of a research presentation on AI's role in security, specifically focusing on a project called 'HTTP Terminator.' Senior engineers may find the insights relevant for understanding AI's application in security contexts.
I'm thrilled to announce "Can AI Do Novel Security Research? Meet the HTTP Terminator" will premiere at
@BlackHatEvents
#BHUSA! Check out the abstract:
๐ 8,260 viewsโค 181๐ 32๐ฌ 8๐ 552.7% eng
NVIDIA and Reliance have established India's largest AI supercomputer cluster, signaling significant investment in AI infrastructure. This development could impact the competitive landscape for AI capabilities in the region.
BIG UPDATE: India Tech & AI Scene on Fire!
เคฏเคนเคพเค เคนเฅเค เคเค เคเฅ 5 เคฌเคกเคผเฅ เคเคฌเคฐเฅเค,
India Tech & AI News (13 April 2026)
1. NVIDIA เคเคฐ Reliance เคเคพ 'Bharat-GPT' เคงเคฎเคพเคเคพ!
NVIDIA เคจเฅ Reliance เคเฅ เคธเคพเคฅ เคฎเคฟเคฒเคเคฐ เคญเคพเคฐเคค เคเคพ เคธเคฌเคธเฅ เคฌเคกเคผเคพ AI Supercomputer เคเฅเคฒเคธเฅเคเคฐ เคธเฅเคเค เคช เคเคฟเคฏเคพ เคนเฅเฅค
Data
Microsoft has open-sourced a toolkit for agent governance that addresses all 10 OWASP agentic AI risks with low latency. It supports multiple programming languages and integrates with existing frameworks, making it a potentially useful resource for building compliant AI systems.
Microsoft open-sourced an agent governance toolkit that covers all 10 OWASP agentic AI risks at sub-millisecond latency.
Python, TypeScript, Rust, Go, .NET. Hooks into LangChain, CrewAI, Google ADK natively.
The compliance layer agents actually needed.
#AIagents
A developer analyzed API requests from different Claude Code versions and discovered that v2.1.100 adds approximately 20,000 invisible tokens to each request. This finding could impact how engineers optimize their API usage and understand token limits.
CLAUDE CODE MAX BURNS YOUR LIMITS 40% FASTER AND NO ONE TOLD YOU WHY
this guy set up an HTTP proxy to capture full API requests across 4 different Claude Code versions.
here's what he found:
Claude Code v2.1.100 silently adds ~20,000 invisible tokens to every single request.
This paper analyzes failure modes in modern AI frameworks, providing empirical insights that could inform better infrastructure design. Senior engineers may find the findings relevant for improving robustness in their systems.
Dissecting Bug Triggers and Failure Modes in Modern Agentic Frameworks: An Empirical Study
Xiaowen Zhang, Hannuo Zhang, Shin Hwei Tan
arxiv.org/abs/2604.08906 [๐๐.๐๐ด]
๐ 0 viewsโค 0๐ 0๐ฌ 0๐ 00.0% eng
AI researchfailure modesagentic frameworksempirical studyinfrastructure
This paper presents a maturity model for AI codebases, detailing the evolution from assisted coding to self-sustaining systems. Senior engineers may find the insights valuable for assessing and improving their own AI infrastructure.
The AI Codebase Maturity Model: From Assisted Coding to Self-Sustaining Systems
Andy Anderson
arxiv.org/abs/2604.09388 [๐๐.๐๐ด ๐๐.๐ฐ๐ธ]
Code:
github.com/kubestellar/co
โฆ
Anthropic's change to Claude code's cache TTL from 1 hour to 5 minutes has led to increased quota usage and costs. This adjustment could impact developers relying on their API for cost management and performance optimization.
It looks like Anthropic changed claude codeโs cache TTL from 1h to 5m in March, causing significant quota and cost inflation.
๐ 8,766 viewsโค 84๐ 11๐ฌ 8๐ 391.2% eng
Claude Opus 4.6 has significantly dropped in the Hallucination benchmark, falling from #2 to #10 with a 15% decrease in accuracy. This decline raises questions about the model's reliability and performance consistency, which is critical for engineers evaluating AI tools.
CLAUDE OPUS 4.6 IS NERFED.
BridgeBench just proved it.
Last week Claude Opus 4.6 ranked #2 on the Hallucination benchmark with an accuracy of 83.3%.
Today Claude Opus 4.6 was retested and it fell to #10 on the leaderboard with an accuracy of only 68.3%.
A 98% increase in
This paper presents novel electroadhesive technology for micro aerial robots, enabling them to perch on smooth and curved surfaces. Senior engineers may find the insights valuable for robotics applications and material science advancements.
Soft Electroadhesive Feet for Micro Aerial Robots Perching on Smooth and Curved Surfaces
Chen Liu, Sonu Feroz, Ketao Zhang
arxiv.org/abs/2604.09270 [๐๐.๐๐พ]
Google Cloud's GKE now supports native HPA for autoscaling without the need for adapters, reducing latency and costs. This change simplifies the scaling process, which could be relevant for engineers managing Kubernetes infrastructure.
Autoscaling on #GKE just got faster & cheaper!
Google removed the "middleman"โno more adapters or complex IAM for custom metrics.
Zero Adapters: Native HPA support
Lower Latency: Scale instantly
Cost Savings: No ingest fees
#GoogleCloud #Kubernetes #DevOps
The tweet discusses identified inefficiencies in OpenCode's single-threaded pubsub implementation and a memory leak, highlighting areas for potential improvement. A senior engineer might find this insight valuable for optimizing similar systems.
yeah after that triggering my obsessive tendencies/adhd I spent several hours yesterday digging through the source for opencode and I see two main sources of inefficiencies beyond that actual memory leak:
1. their pubsub implementation is single threaded and all events go through
OpenAI's revocation of its macOS app certificate due to a supply chain incident highlights vulnerabilities in software signing processes. Senior engineers should care about the implications for security practices in AI tool development.
OpenAI Revokes macOS App Certificate After Malicious Axios Supply Chain Incident: OpenAI revealed a GitHub Actions workflow used to sign its macOS apps, which downloaded the malicious Axios library on March 31, but noted that no user data or internalโฆ
thehackernews.com/2026/04/o
New research highlights significant security vulnerabilities in AI API aggregators, including risks of crypto theft and token leaks. Senior engineers should be aware of these potential Man-in-the-Middle traps when designing API infrastructures.
New research reveals massive security gaps in AI API aggregators. From stolen crypto to leaked tokens, learn why your API hub might be a Man-in-the-Middle trap.
#APISecurity #AISecurity #CyberAttack #LLM #Infosec #DevSecOps #CryptoTheft
securityonline.info/api-transit-hu
โฆ
BenchLM provides a detailed comparison of GPT-5.4, Gemini 3.1 Pro, and Claude Opus 4.6, revealing that the first two models are tied at 94 points. This benchmark data is relevant for engineers assessing the competitive landscape of AI models.
GPT-5.4 and Gemini 3.1 Pro and Claude Opus 4.6 โ three models from three companies โ what's the real difference between them in numbers?
BenchLM did a comprehensive comparison โ and the result: GPT-5.4 and Gemini 3.1 Pro are tied at 94 points โ Claude Opus 4.6 is right behind
๐ 0 viewsโค 0๐ 0๐ฌ 0๐ 00.0% eng
AI modelsbenchmarkingGPT-5.4Gemini 3.1 ProClaude Opus 4.6
Anthropic's new approach reduces AI agent costs by utilizing cheaper models for basic tasks while leveraging smarter models for complex decisions, resulting in a 12% cost reduction and a 2.7% performance boost. This shift could influence how AI systems are architected and deployed.
Anthropic's new advisor strategy flips AI agent costs. Cheaper models are now doing the grunt work and calling smarter ones for help mid-task. 12% cost drop and 2.7% boost in performance. Strange times
Google has released TimesFM, a time-series AI model trained on over 100 billion data points for zero-shot forecasting. This could be relevant for engineers looking to implement advanced predictive analytics in their systems.
Google just open-sourced a time-series AI model that predicts real-world patterns.
Sales. Markets. Traffic. Demand.
Itโs called TimesFM.
Trained on 100B+ data points.
Zero-shot forecasting.
Weโre moving from โAI that talksโ โ โAI that predicts reality.
VIRF proposes a framework for AI safety that uses formal logic to ensure safety is verifiable before execution, enabling plan repair without human intervention. This approach could significantly enhance accountability in AI systems, which is crucial for production environments.
Most organizations treat AI safety as post-deployment monitoring. VIRF inverts this: grounds LLM planners in formal logic to make safety *verifiable* before execution. A deterministic Logic Tutor enables plan repair without runtime human intervention. This is accountability by
๐ 0 viewsโค 0๐ 0๐ฌ 0๐ 00.0% eng
AI safetyformal logicinfrastructureaccountabilityLLM
A security issue has been identified where hardcoded Google API keys in popular Android apps expose Gemini AI. This highlights ongoing vulnerabilities in widely used applications, which is critical for engineers focused on security and infrastructure.
Hardcoded Google API Keys in Top Android Apps Now Expose Gemini AI
cloudsek.com/blog/hardcoded
โฆ #infosec #Android
A researcher has developed a tool that effectively removes Google's SynthID watermark from images generated by Gemini, achieving 90% detection accuracy. This finding could have implications for watermarking techniques in AI-generated content.
One researcher beat Google's watermark with a math trick.
So Google puts an invisible watermark in every image Gemini generates.
They call it SynthID.
And this researcher figured out exactly how it works and built a tool to remove it.
90% detection accuracy. 43+ dB image
This tweet outlines impressive performance metrics for an API, including low response times and high throughput, along with specific AI integrations. A senior engineer might find the architectural details and performance benchmarks relevant for evaluating infrastructure capabilities.
7/ PERFORMANCE
โ <35ms average API response time
โ 10,000+ RPS sustained throughput
โ ~25,000 concurrent users architected
โ 1,000+ concurrent DB transactions via Prisma pooling
AI Integrations:
โ Gemini Vision API โ food parsing in ~1.2s
โGrok API workout JSON in 1.8s
Microsoft has integrated Semantic Kernel and AutoGen into a unified Agent Framework 1.0, offering stable APIs and a commitment to long-term support. This move signals the end of parallel development, providing enterprise-level multi-agent orchestration capabilities for .NET and Python developers.
Microsoft has unified Semantic Kernel + AutoGen into Agent Framework 1.0. Production-ready, stable APIs, LTS commitment. The end of parallel developmentโenterprise multi-agent orchestration out of the box. A pragmatic chess move for all those building agents in .NET or Python.
Anthropic's release of a System Card for each Claude model provides transparency on capabilities, limitations, and testing methodologies. This is significant for engineers focused on responsible AI deployment and understanding model behavior.
Anthropic publishes a System Card for every Claude model they release.
It documents 3 things most companies hide:
โ What the model CAN do
โ What it CANNOT do safely
โ How they tested it before deploying to millions
Here's the full timeline:
โ Mythos Preview โ April
๐ 0 viewsโค 0๐ 0๐ฌ 0๐ 00.0% eng
AI transparencymodel evaluationAnthropicClauderesponsible AI
This tweet discusses a multiscale statistical-mechanical formalization related to AGIJobManager, which may provide novel insights into protocol-mediated intelligence markets. Senior engineers might find the underlying research relevant for understanding new approaches in AGI development.
"Free-energy control in protocol-mediated intelligence markets"
A multiscale statistical-mechanical formalization of AGIJobManager
Vincent Boucher, President,
Montreal.AI and
Quebec.AI :
github.com/MontrealAI/AGI
โฆ
#AGIALPHA #AGIJobs
The tweet highlights an urgent GitHub deadline for CI agents and points out a significant supply chain issue with 1,184 malicious packages in an AI ecosystem. Senior engineers should be aware of these risks and compliance requirements.
โ The April 24 GitHub deadline is load-bearing. Organisations running automated CI agents have until next week to check their opt-out settings
โ 1,184 malicious packages in one AI agent ecosystem is a supply chain crisis that has not received the coverage it deserves
โ
This tweet discusses a new approach to AI agents that allows them to act on-chain without relying on centralized servers. This could be significant for engineers looking to build decentralized applications with AI capabilities.
Every AI agent framework right now has the same unsolved problem.
The agent can reason. It can plan. But it can't act on-chain without a centralised server in the loop.
@0xReactive
fixes this. Agent pre-deploys trigger conditions. Reactive Contract watches. Event fires. Action
This tweet outlines a comprehensive assurance chain for an AI agent using formal methods and machine-checked proofs, which may interest engineers focused on reliability and verification in AI systems. It highlights the rigorous approach to ensuring correctness in AI implementations.
Almost entirely AI agent (Claude) assurance chain:
Formal model in Rocq proof assistant
machine-checked proofs (0 Admitted)
Certified OCaml extraction (+ shim)
Conformance tests against the implementation
Eng expertise: inputs specs, test coverage, proof tips.