feat: readme update

950a145 almost 2 years ago

No virus

6.15 kB

	---
	language:
	- cs
	- en
	- de
	- fr
	- tu
	- zh
	- es
	- ru
	tags:
	- Summarization
	- abstractive summarization
	- mt5-base
	- Czech
	- text2text generation
	- text generation
	license: cc-by-sa-4.0
	datasets:
	- Multilingual_large_dataset_(multilarge)
	- cnc/dm
	- xsum
	- mlsum
	- cnewsum
	- cnc
	- sumeczech
	metrics:
	- rouge
	- rougeraw
	- MemesCS
	---
	# mt5-base-multilingual-summarization-multilarge-cs
	This model is a fine-tuned checkpoint of [google/mt5-base](https://huggingface.co/google/mt5-base) on the Multilingual large summarization dataset focused on Czech texts to produce multilingual summaries.
	## Task
	The model deals with a multi-sentence summary in eight different languages. With the idea of adding other foreign language documents, and by having a considerable amount of Czech documents, we aimed to improve model summarization in the Czech language. Supported languages: ```'cs': '<extra_id_0>', 'en': '<extra_id_1>','de': '<extra_id_2>', 'es': '<extra_id_3>', 'fr': '<extra_id_4>', 'ru': '<extra_id_5>', 'tu': '<extra_id_6>', 'zh': '<extra_id_7>'```

	#Usage

	```python

	## Configuration of summarization pipeline
	#
	def summ_config():
	cfg = OrderedDict([

	## summarization model - checkpoint
	# ctu-aic/m2m100-418M-multilingual-summarization-multilarge-cs
	# ctu-aic/mt5-base-multilingual-summarization-multilarge-cs
	# ctu-aic/mbart25-multilingual-summarization-multilarge-cs
	("model_name", "ctu-aic/mbart25-multilingual-summarization-multilarge-cs"),

	## language of summarization task
	# language : string : cs, en, de, fr, es, tr, ru, zh
	("language", "en"),

	## generation method parameters in dictionary
	#
	("inference_cfg", OrderedDict([
	("num_beams", 4),
	("top_k", 40),
	("top_p", 0.92),
	("do_sample", True),
	("temperature", 0.95),
	("repetition_penalty", 1.23),
	("no_repeat_ngram_size", None),
	("early_stopping", True),
	("max_length", 128),
	("min_length", 10),
	])),
	#texts to summarize values = (list of strings, string, dataset)
	("texts",
	[
	"english text1 to summarize",
	"english text2 to summarize",
	]
	),
	#OPTIONAL: Target summaries values = (list of strings, string, None)
	('golds',
	[
	"target english text1",
	"target english text2",
	]),
	#('golds', None),
	])
	return cfg

	cfg = summ_config()
	mSummarize = MultiSummarizer(**cfg)
	summaries,scores = mSummarize(**cfg)

	```



	## Dataset
	Multilingual large summarization dataset consists of 10 sub-datasets mainly based on news and daily mails. For the training, it was used the entire training set and 72% of the validation set.
	```
	Train set: 3 464 563 docs
	Validation set: 121 260 docs
	```
	\| Stats \| fragment \| \| \| avg document length \| \| avg summary length \| \| Documents \|
	\|-------------\|----------\|---------------------\|--------------------\|--------\|---------\|--------\|--------\|--------\|
	\| __dataset__ \|__compression__ \| __density__ \| __coverage__ \| __nsent__ \| __nwords__ \| __nsent__ \| __nwords__ \| __count__ \|
	\| cnc \| 7.388 \| 0.303 \| 0.088 \| 16.121 \| 316.912 \| 3.272 \| 46.805 \| 750K \|
	\| sumeczech \| 11.769 \| 0.471 \| 0.115 \| 27.857 \| 415.711 \| 2.765 \| 38.644 \| 1M \|
	\| cnndm \| 13.688 \| 2.983 \| 0.538 \| 32.783 \| 676.026 \| 4.134 \| 54.036 \| 300K \|
	\| xsum \| 18.378 \| 0.479 \| 0.194 \| 18.607 \| 369.134 \| 1.000 \| 21.127 \| 225K\|
	\| mlsum/tu \| 8.666 \| 5.418 \| 0.461 \| 14.271 \| 214.496 \| 1.793 \| 25.675 \| 274K \|
	\| mlsum/de \| 24.741 \| 8.235 \| 0.469 \| 32.544 \| 539.653 \| 1.951 \| 23.077 \| 243K\|
	\| mlsum/fr \| 24.388 \| 2.688 \| 0.424 \| 24.533 \| 612.080 \| 1.320 \| 26.93 \| 425K \|
	\| mlsum/es \| 36.185 \| 3.705 \| 0.510 \| 31.914 \| 746.927 \| 1.142 \| 21.671 \| 291K \|
	\| mlsum/ru \| 78.909 \| 1.194 \| 0.246 \| 62.141 \| 948.079 \| 1.012 \| 11.976 \| 27K\|
	\| cnewsum \| 20.183 \| 0.000 \| 0.000 \| 16.834 \| 438.271 \| 1.109 \| 21.926 \| 304K \|
	#### Tokenization
	Truncation and padding were set to 512 tokens for the encoder (input text) and 128 for the decoder (summary).
	## Training
	Trained based on cross-entropy loss.
	```
	Time: 3 days 20 hours
	Epochs: 1080K steps = 10 (from 10)
	GPUs: 4x NVIDIA A100-SXM4-40GB
	eloss: 2.462 - 1.797
	tloss: 17.322 - 1.578
	```
	### ROUGE results per individual dataset test set:

	\| ROUGE \| ROUGE-1 \| \| \| ROUGE-2 \| \| \| ROUGE-L \| \| \|
	\|-----------\|---------\|---------\|-----------\|--------\|--------\|-----------\|--------\|--------\|---------\|
	\| \|Precision \| Recall \| Fscore \| Precision \| Recall \| Fscore \| Precision \| Recall \| Fscore \|
	\| cnc \| 30.62 \| 19.83 \| 23.44 \| 9.94 \| 6.52 \| 7.67 \| 22.92 \| 14.92 \| 17.6 \|
	\| sumeczech \| 27.57 \| 17.6 \| 20.85 \| 8.12 \| 5.23 \| 6.17 \| 20.84 \| 13.38 \| 15.81 \|
	\| cnndm \| 43.83 \| 37.73 \| 39.34 \| 20.81 \| 17.82 \| 18.6 \| 31.8 \| 27.42 \| 28.55 \|
	\| xsum \| 41.63 \| 30.54 \| 34.56 \| 16.13 \| 11.76 \| 13.33 \| 33.65 \| 24.74 \| 27.97 \|
	\| mlsum-tu- \| 54.4 \| 43.29 \| 46.2 \| 38.78 \| 31.31 \| 33.23 \| 48.18 \| 38.44 \| 41 \|
	\| mlsum-de \| 47.94 \| 44.14 \| 45.11 \| 36.42 \| 35.24 \| 35.42 \| 44.43 \| 41.42 \| 42.16 \|
	\| mlsum-fr \| 35.26 \| 25.96 \| 28.98 \| 16.72 \| 12.35 \| 13.75 \| 28.06 \| 20.75 \| 23.12 \|
	\| mlsum-es \| 33.37 \| 24.84 \| 27.52 \| 13.29 \| 10.05 \| 11.05 \| 27.63 \| 20.69 \| 22.87 \|
	\| mlsum-ru \| 0.79 \| 0.66 \| 0.66 \| 0.26 \| 0.2 \| 0.22 \| 0.79 \| 0.66 \| 0.65 \|
	\| cnewsum \| 24.49 \| 24.38 \| 23.23 \| 6.48 \| 6.7 \| 6.24 \| 24.18 \| 24.04 \| 22.91 \|

	# USAGE
	```
	soon
	```