toloka
/

t5-large-for-text-aggregation

text2text-generation

text aggregation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

Nikita Pavlichenko commited on Jul 15, 2021

Commit

167a3c7

•

1 Parent(s): dd150ca

Create README.md

Files changed (1) hide show

README.md +66 -0

README.md ADDED Viewed

	@@ -0,0 +1,66 @@

+---
+language:
+- en
+tags:
+- text aggregation
+- summarization
+license: Apache 2.0
+datasets:
+- toloka/CrowdSpeech
+metrics:
+- wer
+---
+# T5 Large for Text Aggregation
+## Model description
+This is a T5 Large fine-tuned for crowdsourced text aggregation tasks. The model takes multiple performers' responses and yields a single aggregated response. This approach was introduced for the first time during [VLDB'21 Crowd Science Challenge](https://crowdscience.ai/challenges/vldb21) and originally implemented at the second-place competitor's [GitHub](https://github.com/A1exRey/VLDB2021_workshop_t5).
+## How to use
+```python
+from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, AutoConfig
+mname = "toloka/t5-large-for-text-aggregation"
+tokenizer = AutoTokenizer.from_pretrained(mname)
+model = AutoModelForSeq2SeqLM.from_pretrained(mname)
+input = "samplee text | sampl text | sample textt"
+input_ids = tokenizer.encode(input, return_tensors="pt")
+outputs = model.generate(input_ids)
+decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
+print(decoded)  # sample text
+```
+## Training data
+Pretrained weights were taken from the [original](https://huggingface.co/t5-large) T5 Large model by Google. For more details on the T5 architecture and training procedure see https://arxiv.org/abs/1910.10683
+Model was fine-tuned on `train-clean`, `dev-clean` and `dev-other` parts of the [CrowdSpeech](https://huggingface.co/datasets/toloka/CrowdSpeech) dataset that was introduced in [our paper](https://openreview.net/forum?id=3_hgF1NAXU7&referrer=%5BAuthor%20Console%5D(%2Fgroup%3Fid%3DNeurIPS.cc%2F2021%2FTrack%2FDatasets_and_Benchmarks%2FRound1%2FAuthors%23your-submissions).
+## Training procedure
+The model was fine-tuned for eight epochs directly following the HuggingFace summarization training [example](https://github.com/huggingface/transformers/tree/master/examples/pytorch/summarization).
+## Eval results
+Dataset    | Split      | WER
+-----------|------------|----------
+CrowdSpeech| test-clean | 4.99
+CrowdSpeech| test-other | 10.61
+### BibTeX entry and citation info
+```bibtex
+@misc{pavlichenko2021vox,
+      title={Vox Populi, Vox DIY: Benchmark Dataset for Crowdsourced Audio Transcription},
+      author={Nikita Pavlichenko and Ivan Stelmakh and Dmitry Ustalov},
+      year={2021},
+      eprint={2107.01091},
+      archivePrefix={arXiv},
+      primaryClass={cs.SD}
+}
+```