Text Generation
Transformers
Inference Endpoints
medit-xxl / README.md
librarian-bot's picture
Librarian Bot: Update Hugging Face dataset ID
9d73720 verified
|
raw
history blame
3.66 kB
---
language:
- en
- de
- es
- ar
- ja
- ko
- zh
license: cc-by-nc-sa-4.0
library_name: transformers
datasets:
- wi_locness
- matejklemen/falko_merlin
- paws
- paws-x
- facebook/asset
metrics:
- bleu
- rouge
- sari
- accuracy
---
# Model Card for mEdIT-xxl
The `medit-xxl` model was obtained by fine-tuning the `MBZUAI/bactrian-x-llama-13b-lora` model on the mEdIT dataset.
**Paper:** mEdIT: Multilingual Text Editing via Instruction Tuning
**Authors:** Vipul Raheja, Dimitris Alikaniotis, Vivek Kulkarni, Bashar Alhafni, Dhruv Kumar
## Model Details
### Model Description
- **Language(s) (NLP)**: Arabic, Chinese, English, German, Japanese, Korean, Spanish
- **Finetuned from model:** `MBZUAI/bactrian-x-llama-13b-lora`
### Model Sources
- **Repository:** https://github.com/vipulraheja/medit
- **Paper:** https://arxiv.org/abs/2402.16472v1
## How to use
Given an edit instruction and an original text, our model can generate the edited version of the text.<br>
![task_specs](https://cdn-uploads.huggingface.co/production/uploads/60985a0547dc3dbf8a976607/816ZY2t0XPCpMMd6Z072K.png)
Specifically, our models support both multi-lingual and cross-lingual text revision. Note that the input and output texts are always in the same language. The monolingual
vs. cross-lingual setting is determined by comparing the language of the edit instruction in relation to the language of the input text.
### Instruction format
Adherence to the following instruction format is essential; failure to do so may result in the model producing less-than-ideal results.
```
instruction_tokens = [
"Instruction",
"Anweisung",
...
]
input_tokens = [
"Input",
"Aporte",
...
]
output_tokens = [
"Output",
"Produzione",
...
]
task_descriptions = [
"Fix grammatical errors in this sentence", # <-- GEC task
"Umschreiben Sie den Satz", # <-- Paraphrasing
...
]
```
**The entire list of possible instructions, input/output tokens, and task descriptions can be found in the Appendix of our paper.**
```
prompt_template = """### <instruction_token>:\n<task_description>\n### <input_token>:\n<input>\n### <output_token>:\n\n"""
```
Note that the tokens and the task description need not be in the language of the input (in the case of cross-lingual revision).
### Run the model
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "grammarly/medit-xxl"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
# English GEC using Japanese instructions
prompt = '### 命什:\nζ–‡η« γ‚’ζ–‡ζ³•ηš„γ«γ™γ‚‹\n### ε…₯εŠ›:\nI has small cat ,\n### ε‡ΊεŠ›:\n\n'
inputs = tokenizer(prompt, return_tensors='pt')
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True)
# --> I have a small cat ,
# German GEC using Japanese instructions
prompt = '### 命什:\nζ–‡η« γ‚’ζ–‡ζ³•ηš„γ«γ™γ‚‹\n### ε…₯εŠ›:\nIch haben eines kleines Katze ,\n### ε‡ΊεŠ›:\n\n'
# ...
# --> Ich habe eine kleine Katze ,
```
#### Software
https://github.com/vipulraheja/medit
## Citation
**BibTeX:**
```
@article{raheja2023medit,
title={mEdIT: mEdIT: Multilingual Text Editing via Instruction Tuning},
author={Vipul Raheja and Dimitris Alikaniotis and Vivek Kulkarni and Bashar Alhafni and Dhruv Kumar},
year={2024},
eprint={2402.16472v1},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
**APA:**
Raheja, V., Alikaniotis, D., Kulkarni, V., Alhafni, B., & Kumar, D. (2024). MEdIT: Multilingual Text Editing via Instruction Tuning. ArXiv. /abs/2402.16472