arxiv:2410.04422

Hyper-multi-step: The Truth Behind Difficult Long-context Tasks

Published on Oct 6

· Submitted by

yuyijiong on Oct 9

Upvote

Authors:

Yijiong Yu

Abstract

Long-context language models (LCLM), characterized by their extensive context window, is becoming increasingly popular. Meanwhile, many long-context benchmarks present challenging tasks that even the most advanced LCLMs struggle to complete. However, the underlying sources of various challenging long-context tasks have seldom been studied. To bridge this gap, we conduct experiments to indicate their difficulty stems primarily from two basic issues: "multi-matching retrieval," which requires the simultaneous retrieval of multiple items, and "logic-based retrieval," which necessitates logical judgment within retrieval criteria. These two problems, while seemingly straightforward, actually exceed the capabilities of LCLMs because they are proven to be hyper-multi-step (demanding numerous steps to solve) in nature. This finding could explain why LLMs struggle with more advanced long-context tasks, providing a more accurate perspective for rethinking solutions for them.

View arXiv page View PDF Add to collection

Community

yuyijiong

Paper author Paper submitter 8 days ago

•

edited 2 days ago

Our code is publicly available at https://github.com/yuyijiong/hard_retrieval_for_llm
The datasets is at https://huggingface.co/datasets/yuyijiong/difficult_retrieval

yuyijiong

Paper author Paper submitter 8 days ago

•

edited 7 days ago

This paper reveals a tough fact 🤕 that:
A long-context language model can never directly address advanced long-context tasks well 😵, such as repo-level code generation or filtering tabular data. This is because LLMs are inherently unable to complete a large number of reasoning steps within a limited generation length 😩, which is often a necessity for advanced long-context tasks 🔢, but not for simple long-context tasks like needle-in-a-haystack 😃.

When doing retrieval, LLMs are actually searching "relevant" items, but not "logical corresponding" ones.

A model good at both math and retrieval still cannot directly solve a math + retrieval task, unless it pays much more efforts in test-time 🧐.

SquirrellyMatt

7 days ago

This is a well-sequenced paper.

A minute before reading "Can these issues be further decomposed into simple solvable components?" my brain had the same thought. More or less, "Can direct retrieval of 100+ KVs be decomposed?"

With that being said, the logic and multi-retrieval issues are (from my perspective) different types/categories of mathematical problems (would an LLM fine-tuned for Math problems like Qwen2.5-Math-1.5B/7B/72B vs. a non-fine-tuned model as Qwen2.5-1.5B/7B/72B perform better due to having been trained on mathematical problem-solving data?).

One point: When I see the word "delve," (source: To delve deeper into why LLMs struggle...) I think, "Did an LLM edit or write this paper?" However, that is my bias as an American-English reader, and this word may be intrinsic to your culture, so I do not think you need to remove it.

Thank you for your contribution to science and engineering!

librarian-bot

7 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2410.04422 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2410.04422 in a Space README.md to link it from this page.