arxiv:2410.14677

Are AI Detectors Good Enough? A Survey on Quality of Datasets With Machine-Generated Texts

Published on Oct 18

· Submitted by

andriygav on Oct 21

Upvote

Authors:

German Gritsai ,

Anastasia Voznyuk ,

Andrey Grabovoy ,

Abstract

The rapid development of autoregressive Large Language Models (LLMs) has significantly improved the quality of generated texts, necessitating reliable machine-generated text detectors. A huge number of detectors and collections with AI fragments have emerged, and several detection methods even showed recognition quality up to 99.9% according to the target metrics in such collections. However, the quality of such detectors tends to drop dramatically in the wild, posing a question: Are detectors actually highly trustworthy or do their high benchmark scores come from the poor quality of evaluation datasets? In this paper, we emphasise the need for robust and qualitative methods for evaluating generated data to be secure against bias and low generalising ability of future model. We present a systematic review of datasets from competitions dedicated to AI-generated content detection and propose methods for evaluating the quality of datasets containing AI-generated fragments. In addition, we discuss the possibility of using high-quality generated data to achieve two goals: improving the training of detection models and improving the training datasets themselves. Our contribution aims to facilitate a better understanding of the dynamics between human and machine text, which will ultimately support the integrity of information in an increasingly automated world.

View arXiv page View PDF Add to collection

Community

andriygav

Paper author Paper submitter 11 days ago

In our new paper we demonstrated that AI-detection shared tasks and research papers datasets are inadequate for evaluation of AI detectors, resulting in systematic errors that inflate the detectors' quality scores.

surfmore

10 days ago

What are the detectors now for text and code

surfmore

10 days ago

What are the detectors now for text and code

librarian-bot

10 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

andriygav

Paper author Paper submitter 10 days ago

We focus on plain text. I think code generation detection is a harder task because of the lack of coherence in the text representation. Code is like a list of statements that do not form a single sequence (even though they solve the same problem).

Also, testing code generation is a less popular task now, and there are not many datasets for it and competitions at leading ML conferences. Even for detection text generation, when we have a lot of data, we show that this data is not good enough to test detectors.

Plain text detectors are very popular now, but they are all built base on Berth-like models or on the analysis of the frequency of occurrence of words. We mention some of the state-of-the-art detection models in our article.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2410.14677 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2410.14677 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2410.14677 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.