A comparison of four AI models on their ability to prove a hard theorem reveals significant differences in performance, with Grok Expert leading. This insight into model capabilities could inform future development and benchmarking efforts.
Gave 4 AI models a hard new theorem to prove. Rankings: 1. Grok Expert - quick and elegant proof. 2. Gemini Pro - close runner-up. 3. ChatGPT Pro claimed the theorem was incorrect and had no proof. 4. Claude Opus just gave up after some time with no output (is it really nerfed?)
👁 0 views❤ 0🔁 0💬 0🔖 00.0% eng
AI modelstheorem provingbenchmarkingGrok ExpertChatGPT
This tweet discusses a comparative study of four types of human experience data used in generative AI workflows, which could provide insights into user interaction and experience design. Senior engineers may find the methodology and findings relevant for improving AI system design.
We compare 4 types of human experience data in a GAI workflow:
C1: demographics
C2: gaze (eye-tracking)
C3: questionnaire-based experience
C4: AI-predicted experience
12 designers + 30 evaluators
(4/)