squad_precision_recall

Sleeping

App Files Files Community

squad_precision_recall / README.md

omidf

Duplicate from evaluate-metric/squad

63e7fe5 almost 2 years ago

preview code

raw

history blame contribute delete

5.14 kB

	---
	title: SQuAD
	emoji: 🤗
	colorFrom: blue
	colorTo: red
	sdk: gradio
	sdk_version: 3.0.2
	app_file: app.py
	pinned: false
	tags:
	- evaluate
	- metric
	description: >-
	This metric wrap the official scoring script for version 1 of the Stanford
	Question Answering Dataset (SQuAD).

	Stanford Question Answering Dataset (SQuAD) is a reading comprehension
	dataset, consisting of questions posed by crowdworkers on a set of Wikipedia
	articles, where the answer to every question is a segment of text, or span,
	from the corresponding reading passage, or the question might be unanswerable.
	duplicated_from: evaluate-metric/squad
	---

	# Metric Card for SQuAD

	## Metric description
	This metric wraps the official scoring script for version 1 of the [Stanford Question Answering Dataset (SQuAD)](https://huggingface.co/datasets/squad).

	SQuAD is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

	## How to use

	The metric takes two files or two lists of question-answers dictionaries as inputs : one with the predictions of the model and the other with the references to be compared to:

	```python
	from evaluate import load
	squad_metric = load("squad")
	results = squad_metric.compute(predictions=predictions, references=references)
	```
	## Output values

	This metric outputs a dictionary with two values: the average exact match score and the average [F1 score](https://huggingface.co/metrics/f1).

	```
	{'exact_match': 100.0, 'f1': 100.0}
	```

	The range of `exact_match` is 0-100, where 0.0 means no answers were matched and 100.0 means all answers were matched.

	The range of `f1` is 0-1 -- its lowest possible value is 0, if either the precision or the recall is 0, and its highest possible value is 1.0, which means perfect precision and recall.

	### Values from popular papers
	The [original SQuAD paper](https://nlp.stanford.edu/pubs/rajpurkar2016squad.pdf) reported an F1 score of 51.0% and an Exact Match score of 40.0%. They also report that human performance on the dataset represents an F1 score of 90.5% and an Exact Match score of 80.3%.

	For more recent model performance, see the [dataset leaderboard](https://paperswithcode.com/dataset/squad).

	## Examples

	Maximal values for both exact match and F1 (perfect match):

	```python
	from evaluate import load
	squad_metric = load("squad")
	predictions = [{'prediction_text': '1976', 'id': '56e10a3be3433e1400422b22'}]
	references = [{'answers': {'answer_start': [97], 'text': ['1976']}, 'id': '56e10a3be3433e1400422b22'}]
	results = squad_metric.compute(predictions=predictions, references=references)
	results
	{'exact_match': 100.0, 'f1': 100.0}
	```

	Minimal values for both exact match and F1 (no match):

	```python
	from evaluate import load
	squad_metric = load("squad")
	predictions = [{'prediction_text': '1999', 'id': '56e10a3be3433e1400422b22'}]
	references = [{'answers': {'answer_start': [97], 'text': ['1976']}, 'id': '56e10a3be3433e1400422b22'}]
	results = squad_metric.compute(predictions=predictions, references=references)
	results
	{'exact_match': 0.0, 'f1': 0.0}
	```

	Partial match (2 out of 3 answers correct) :

	```python
	from evaluate import load
	squad_metric = load("squad")
	predictions = [{'prediction_text': '1976', 'id': '56e10a3be3433e1400422b22'}, {'prediction_text': 'Beyonce', 'id': '56d2051ce7d4791d0090260b'}, {'prediction_text': 'climate change', 'id': '5733b5344776f419006610e1'}]
	references = [{'answers': {'answer_start': [97], 'text': ['1976']}, 'id': '56e10a3be3433e1400422b22'}, {'answers': {'answer_start': [233], 'text': ['Beyoncé and Bruno Mars']}, 'id': '56d2051ce7d4791d0090260b'}, {'answers': {'answer_start': [891], 'text': ['climate change']}, 'id': '5733b5344776f419006610e1'}]
	results = squad_metric.compute(predictions=predictions, references=references)
	results
	{'exact_match': 66.66666666666667, 'f1': 66.66666666666667}
	```

	## Limitations and bias
	This metric works only with datasets that have the same format as [SQuAD v.1 dataset](https://huggingface.co/datasets/squad).

	The SQuAD dataset does contain a certain amount of noise, such as duplicate questions as well as missing answers, but these represent a minority of the 100,000 question-answer pairs. Also, neither exact match nor F1 score reflect whether models do better on certain types of questions (e.g. who questions) or those that cover a certain gender or geographical area -- carrying out more in-depth error analysis can complement these numbers.


	## Citation

	@inproceedings{Rajpurkar2016SQuAD10,
	title={SQuAD: 100, 000+ Questions for Machine Comprehension of Text},
	author={Pranav Rajpurkar and Jian Zhang and Konstantin Lopyrev and Percy Liang},
	booktitle={EMNLP},
	year={2016}
	}

	## Further References

	- [The Stanford Question Answering Dataset: Background, Challenges, Progress (blog post)](https://rajpurkar.github.io/mlx/qa-and-squad/)
	- [Hugging Face Course -- Question Answering](https://huggingface.co/course/chapter7/7)