arxiv:2405.18870

LLMs achieve adult human performance on higher-order theory of mind tasks

Published on May 29

· Submitted by

akhaliq on May 30

#3 Paper of the day

Upvote

Authors:

Tatenda Kanyere ,

Abstract

This paper examines the extent to which large language models (LLMs) have developed higher-order theory of mind (ToM); the human ability to reason about multiple mental and emotional states in a recursive manner (e.g. I think that you believe that she knows). This paper builds on prior work by introducing a handwritten test suite -- Multi-Order Theory of Mind Q&A -- and using it to compare the performance of five LLMs to a newly gathered adult human benchmark. We find that GPT-4 and Flan-PaLM reach adult-level and near adult-level performance on ToM tasks overall, and that GPT-4 exceeds adult performance on 6th order inferences. Our results suggest that there is an interplay between model size and finetuning for the realisation of ToM abilities, and that the best-performing LLMs have developed a generalised capacity for ToM. Given the role that higher-order ToM plays in a wide range of cooperative and competitive human behaviours, these findings have significant implications for user-facing LLM applications.

View arXiv page View PDF Add to collection

Community

TornikeO

May 30

I still struggle to see how this alone would be useful when assisting users. I have trouble recalling the last time I had to think about tasks that the paper claims the models excel at. Maybe I'm missing some larger picture...

sauravm8

May 31

Think about solving games. It will be great at playing poker or other head to head games.

SquirrellyMatt

May 30

I think the idea is that LLMs may drive robotics one day, in which it may benefit the robot to have an accurate representation of a developed higher-order theory of mind. This would benefit autonomous robots when interacting in a social situation.

heanu

May 31

Check out the recent results on our benchmark FANToM as well, which was presented at EMNLP 2023.
We stress-test the SOTA LLMs, such as GPT-4o, Gemini-1.5, Llama3, Mixtral, and Claude.
They are nowhere near human performance, but still they are improving!
https://github.com/skywalker023/fantom?tab=readme-ov-file#-latest-results