medit-xxl / README.md

Librarian Bot: Update Hugging Face dataset ID

9d73720 verified 7 months ago

3.66 kB

	---
	language:
	- en
	- de
	- es
	- ar
	- ja
	- ko
	- zh
	license: cc-by-nc-sa-4.0
	library_name: transformers
	datasets:
	- wi_locness
	- matejklemen/falko_merlin
	- paws
	- paws-x
	- facebook/asset
	metrics:
	- bleu
	- rouge
	- sari
	- accuracy
	---

	# Model Card for mEdIT-xxl

	The `medit-xxl` model was obtained by fine-tuning the `MBZUAI/bactrian-x-llama-13b-lora` model on the mEdIT dataset.

	Paper: mEdIT: Multilingual Text Editing via Instruction Tuning

	Authors: Vipul Raheja, Dimitris Alikaniotis, Vivek Kulkarni, Bashar Alhafni, Dhruv Kumar

	## Model Details

	### Model Description

	- Language(s) (NLP): Arabic, Chinese, English, German, Japanese, Korean, Spanish
	- Finetuned from model: `MBZUAI/bactrian-x-llama-13b-lora`

	### Model Sources

	- Repository: https://github.com/vipulraheja/medit
	- Paper: https://arxiv.org/abs/2402.16472v1

	## How to use

	Given an edit instruction and an original text, our model can generate the edited version of the text.<br>

	![task_specs](https://cdn-uploads.huggingface.co/production/uploads/60985a0547dc3dbf8a976607/816ZY2t0XPCpMMd6Z072K.png)

	Specifically, our models support both multi-lingual and cross-lingual text revision. Note that the input and output texts are always in the same language. The monolingual
	vs. cross-lingual setting is determined by comparing the language of the edit instruction in relation to the language of the input text.

	### Instruction format

	Adherence to the following instruction format is essential; failure to do so may result in the model producing less-than-ideal results.

	```
	instruction_tokens = [
	"Instruction",
	"Anweisung",
	...
	]

	input_tokens = [
	"Input",
	"Aporte",
	...
	]

	output_tokens = [
	"Output",
	"Produzione",
	...
	]

	task_descriptions = [
	"Fix grammatical errors in this sentence", # <-- GEC task
	"Umschreiben Sie den Satz", # <-- Paraphrasing
	...
	]
	```

	The entire list of possible instructions, input/output tokens, and task descriptions can be found in the Appendix of our paper.

	```
	prompt_template = """### <instruction_token>:\n<task_description>\n### <input_token>:\n<input>\n### <output_token>:\n\n"""
	```

	Note that the tokens and the task description need not be in the language of the input (in the case of cross-lingual revision).


	### Run the model

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM

	model_id = "grammarly/medit-xxl"
	tokenizer = AutoTokenizer.from_pretrained(model_id)

	model = AutoModelForCausalLM.from_pretrained(model_id)

	# English GEC using Japanese instructions
	prompt = '### 命令:\n文章を文法的にする\n### 入力:\nI has small cat ,\n### 出力:\n\n'

	inputs = tokenizer(prompt, return_tensors='pt')

	outputs = model.generate(**inputs, max_new_tokens=20)

	print(tokenizer.decode(outputs[0], skip_special_tokens=True)

	# --> I have a small cat ,

	# German GEC using Japanese instructions
	prompt = '### 命令:\n文章を文法的にする\n### 入力:\nIch haben eines kleines Katze ,\n### 出力:\n\n'

	# ...
	# --> Ich habe eine kleine Katze ,
	```

	#### Software
	https://github.com/vipulraheja/medit

	## Citation

	BibTeX:
	```
	@article{raheja2023medit,
	title={mEdIT: mEdIT: Multilingual Text Editing via Instruction Tuning},
	author={Vipul Raheja and Dimitris Alikaniotis and Vivek Kulkarni and Bashar Alhafni and Dhruv Kumar},
	year={2024},
	eprint={2402.16472v1},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}
	```

	APA:
	Raheja, V., Alikaniotis, D., Kulkarni, V., Alhafni, B., & Kumar, D. (2024). MEdIT: Multilingual Text Editing via Instruction Tuning. ArXiv. /abs/2402.16472