Create README.md

a2799df verified 8 months ago

9.14 kB


	---
	license: apache-2.0
	language:
	- multilingual
	- en
	- ru
	- es
	- fr
	- de
	- it
	- pt
	- pl
	- nl
	- vi
	- tr
	- sv
	- id
	- ro
	- cs
	- zh
	- hu
	- ja
	- th
	- fi
	- fa
	- uk
	- da
	- el
	- 'no'
	- bg
	- sk
	- ko
	- ar
	- lt
	- ca
	- sl
	- he
	- et
	- lv
	- hi
	- sq
	- ms
	- az
	- sr
	- ta
	- hr
	- kk
	- is
	- ml
	- mr
	- te
	- af
	- gl
	- fil
	- be
	- mk
	- eu
	- bn
	- ka
	- mn
	- bs
	- uz
	- ur
	- sw
	- yue
	- ne
	- kn
	- kaa
	- gu
	- si
	- cy
	- eo
	- la
	- hy
	- ky
	- tg
	- ga
	- mt
	- my
	- km
	- tt
	- so
	- ku
	- ps
	- pa
	- rw
	- lo
	- ha
	- dv
	- fy
	- lb
	- ckb
	- mg
	- gd
	- am
	- ug
	- ht
	- grc
	- hmn
	- sd
	- jv
	- mi
	- tk
	- ceb
	- yi
	- ba
	- fo
	- or
	- xh
	- su
	- kl
	- ny
	- sm
	- sn
	- co
	- zu
	- ig
	- yo
	- pap
	- st
	- haw
	- as
	- oc
	- cv
	- lus
	- tet
	- gsw
	- sah
	- br
	- rm
	- sa
	- bo
	- om
	- se
	- ce
	- cnh
	- ilo
	- hil
	- udm
	- os
	- lg
	- ti
	- vec
	- ts
	- tyv
	- kbd
	- ee
	- iba
	- av
	- kha
	- to
	- tn
	- nso
	- fj
	- zza
	- ak
	- ada
	- otq
	- dz
	- bua
	- cfm
	- ln
	- chm
	- gn
	- krc
	- wa
	- hif
	- yua
	- srn
	- war
	- rom
	- bik
	- pam
	- sg
	- lu
	- ady
	- kbp
	- syr
	- ltg
	- myv
	- iso
	- kac
	- bho
	- ay
	- kum
	- qu
	- za
	- pag
	- ngu
	- ve
	- pck
	- zap
	- tyz
	- hui
	- bbc
	- tzo
	- tiv
	- ksd
	- gom
	- min
	- ang
	- nhe
	- bgp
	- nzi
	- nnb
	- nv
	- zxx
	- bci
	- kv
	- new
	- mps
	- alt
	- meu
	- bew
	- fon
	- iu
	- abt
	- mgh
	- mnw
	- tvl
	- dov
	- tlh
	- ho
	- kw
	- mrj
	- meo
	- crh
	- mbt
	- emp
	- ace
	- ium
	- mam
	- gym
	- mai
	- crs
	- pon
	- ubu
	- fip
	- quc
	- gv
	- kj
	- btx
	- ape
	- chk
	- rcf
	- shn
	- tzh
	- mdf
	- ppk
	- ss
	- gag
	- cab
	- kri
	- seh
	- ibb
	- tbz
	- bru
	- enq
	- ach
	- cuk
	- kmb
	- wo
	- kek
	- qub
	- tab
	- bts
	- kos
	- rwo
	- cak
	- tuc
	- bum
	- cjk
	- gil
	- stq
	- tsg
	- quh
	- mak
	- arn
	- ban
	- jiv
	- sja
	- yap
	- tcy
	- toj
	- twu
	- xal
	- amu
	- rmc
	- hus
	- nia
	- kjh
	- bm
	- guh
	- mas
	- acf
	- dtp
	- ksw
	- bzj
	- din
	- zne
	- mad
	- msi
	- mag
	- mkn
	- kg
	- lhu
	- ch
	- qvi
	- mh
	- djk
	- sus
	- mfe
	- srm
	- dyu
	- ctu
	- gui
	- pau
	- inb
	- bi
	- mni
	- guc
	- jam
	- wal
	- jac
	- bas
	- gor
	- skr
	- nyu
	- noa
	- sda
	- gub
	- nog
	- cni
	- teo
	- tdx
	- sxn
	- rki
	- nr
	- frp
	- alz
	- taj
	- lrc
	- cce
	- rn
	- jvn
	- hvn
	- nij
	- dwr
	- izz
	- msm
	- bus
	- ktu
	- chr
	- maz
	- tzj
	- suz
	- knj
	- bim
	- gvl
	- bqc
	- tca
	- pis
	- prk
	- laj
	- mel
	- qxr
	- niq
	- ahk
	- shp
	- hne
	- spp
	- koi
	- krj
	- quf
	- luz
	- agr
	- tsc
	- mqy
	- gof
	- gbm
	- miq
	- dje
	- awa
	- bjj
	- qvz
	- sjp
	- tll
	- raj
	- kjg
	- bgz
	- quy
	- cbk
	- akb
	- oj
	- ify
	- mey
	- ks
	- cac
	- brx
	- qup
	- syl
	- jax
	- ff
	- ber
	- tks
	- trp
	- mrw
	- adh
	- smt
	- srr
	- ffm
	- qvc
	- mtr
	- ann
	- kaa
	- aa
	- noe
	- nut
	- gyn
	- kwi
	- xmm
	- msb
	library_name: transformers
	tags:
	- text2text-generation
	- text-generation-inference
	datasets:
	- allenai/MADLAD-400
	pipeline_tag: translation
	metrics:
	- bleu
	---

	# Model Card for MADLAD-400-7B-CT2-int8

	# Table of Contents

	0. [TL;DR](#TL;DR)
	1. [Model Details](#model-details)
	2. [Usage](#usage)
	3. [Uses](#uses)
	4. [Bias, Risks, and Limitations](#bias-risks-and-limitations)
	5. [Training Details](#training-details)
	6. [Evaluation](#evaluation)
	7. [Environmental Impact](#environmental-impact)
	8. [Citation](#citation)

	# TL;DR

	MADLAD-400-7B-MT is a multilingual machine translation model based on the T5 architecture that was
	trained on 1 trillion tokens covering over 450 languages using publicly available data.
	It is competitive with models that are significantly larger.

	Disclaimer: [Heng-Shiou Sheu](https://huggingface.co/Heng666), who was not involved in this research, converted
	the original models to CTranslate2 optimized model and wrote the contents of this model card based on [google/madlad400-7b-mt](https://huggingface.co/google/madlad400-7b-mt).

	# Model Details

	## Model Description

	- Model type: Language model
	- Language(s) (NLP): Multilingual (400+ languages)
	- License: Apache 2.0
	- Related Models: [All MADLAD-400 Checkpoints](https://huggingface.co/models?search=madlad)
	- Original Checkpoints: [All Original MADLAD-400 Checkpoints](https://github.com/google-research/google-research/tree/master/madlad_400)
	- Resources for more information:
	- [Research paper](https://arxiv.org/abs/2309.04662)
	- [GitHub Repo](https://github.com/google-research/t5x)
	- [Hugging Face MADLAD-400 Docs (Similar to T5) ](https://huggingface.co/docs/transformers/model_doc/MADLAD-400) - [Pending PR](https://github.com/huggingface/transformers/pull/27471)

	# Usage

	Find below some example scripts on how to use the model:

	## Running the model on a CPU or GPU

	First, install the CTranslate2 packages that are required:

	`pip install ctranslate2 sentencepiece`

	```python
	import ctranslate2
	from sentencepiece import SentencePieceProcessor
	from huggingface_hub import snapshot_download

	model_name = "Heng666/madlad400-7b-ct2-int8"
	model_path = snapshot_download(model_name)

	tokenizer = SentencePieceProcessor()
	tokenizer.load(f"{model_path}/sentencepiece.model")
	translator = ctranslate2.Translator(model_path)

	input_text = "I love pizza!"
	input_tokens = tokenizer.encode(f"<2{target_language}> {input_text}", out_type=str)
	results = translator.translate_batch(
	[input_tokens],
	batch_type="tokens",
	max_batch_size=1024,
	beam_size=1,
	no_repeat_ngram_size=1,
	repetition_penalty=2,
	)
	translated_sentence = tokenizer.decode(results[0].hypotheses[0])
	print(translated_sentence)
	# Eu adoro pizza!
	```


	# Uses

	## Direct Use and Downstream Use

	> Primary intended uses: Machine Translation and multilingual NLP tasks on over 400 languages.
	> Primary intended users: Research community.

	## Out-of-Scope Use

	> These models are trained on general domain data and are therefore not meant to
	> work on domain-specific models out-of-the box. Moreover, these research models have not been assessed
	> for production usecases.

	# Bias, Risks, and Limitations

	> We note that we evaluate on only 204 of the languages supported by these models and on machine translation
	> and few-shot machine translation tasks. Users must consider use of this model carefully for their own
	> usecase.

	## Ethical considerations and risks

	> We trained these models with MADLAD-400 and publicly available data to create baseline models that
	> support NLP for over 400 languages, with a focus on languages underrepresented in large-scale corpora.
	> Given that these models were trained with web-crawled datasets that may contain sensitive, offensive or
	> otherwise low-quality content despite extensive preprocessing, it is still possible that these issues to the
	> underlying training data may cause differences in model performance and toxic (or otherwise problematic)
	> output for certain domains. Moreover, large models are dual use technologies that have specific risks
	> associated with their use and development. We point the reader to surveys such as those written by
	> Weidinger et al. or Bommasani et al. for a more detailed discussion of these risks, and to Liebling
	> et al. for a thorough discussion of the risks of machine translation systems.

	## Known Limitations

	More information needed

	## Sensitive Use:

	More information needed

	# Training Details

	> We train models of various sizes: a 7b, 32-layer parameter model,
	> a 7.2B 48-layer parameter model and a 10.7B 32-layer parameter model.
	> We share all parameters of the model across language pairs,
	> and use a Sentence Piece Model with 256k tokens shared on both the encoder and decoder
	> side. Each input sentence has a <2xx> token prepended to the source sentence to indicate the target
	> language.

	See the [research paper](https://arxiv.org/pdf/2309.04662.pdf) for further details.

	## Training Data

	> For both the machine translation and language model, MADLAD-400 is used. For the machine translation
	> model, a combination of parallel datasources covering 157 languages is also used. Further details are
	> described in the [paper](https://arxiv.org/pdf/2309.04662.pdf).

	## Training Procedure

	See the [research paper](https://arxiv.org/pdf/2309.04662.pdf) for further details.

	# Evaluation

	## Testing Data, Factors & Metrics

	> For evaluation, we used WMT, NTREX, Flores-200 and Gatones datasets as described in Section 4.3 in the [paper](https://arxiv.org/pdf/2309.04662.pdf).

	> The translation quality of this model varies based on language, as seen in the paper, and likely varies on
	> domain, though we have not assessed this.

	## Results

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/64b7f632037d6452a321fa15/EzsMD1AwCuFH0S0DeD-n8.png)

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/64b7f632037d6452a321fa15/CJ5zCUVy7vTU76Lc8NZcK.png)

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/64b7f632037d6452a321fa15/NK0S-yVeWuhKoidpLYh3m.png)

	See the [research paper](https://arxiv.org/pdf/2309.04662.pdf) for further details.

	# Environmental Impact

	More information needed

	# Citation

	BibTeX:

	```bibtex
	@misc{kudugunta2023madlad400,
	title={MADLAD-400: A Multilingual And Document-Level Large Audited Dataset},
	author={Sneha Kudugunta and Isaac Caswell and Biao Zhang and Xavier Garcia and Christopher A. Choquette-Choo and Katherine Lee and Derrick Xin and Aditya Kusupati and Romi Stella and Ankur Bapna and Orhan Firat},
	year={2023},
	eprint={2309.04662},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}
	```