Update README.md

dd61435 verified 6 months ago

7.12 kB

	---
	license: mit
	language:
	- fr
	library_name: transformers
	tags:
	- linformer
	- legal
	- RoBERTa
	- pytorch
	---

	# Jargon-general-legal

	[Jargon](https://hal.science/hal-04535557/file/FB2_domaines_specialises_LREC_COLING24.pdf) is an efficient transformer encoder LM for French, combining the LinFormer attention mechanism with the RoBERTa model architecture.

	Jargon is available in several versions with different context sizes and types of pre-training corpora.

	<!-- Provide a quick summary of what the model is/does. -->

	<!-- This modelcard aims to be a base template for new models. It has been generated using [this raw template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md?plain=1).
	-->

	\| Model \| Initialised from... \|Training Data\|
	\|-------------------------------------------------------------------------------------\|:-----------------------:\|:----------------:\|
	\| [jargon-general-base](https://huggingface.co/PantagrueLLM/jargon-general-base) \| scratch \|8.5GB Web Corpus\|
	\| [jargon-general-biomed](https://huggingface.co/PantagrueLLM/jargon-general-biomed) \| jargon-general-base \|5.4GB Medical Corpus\|
	\| [jargon-general-legal](https://huggingface.co/PantagrueLLM/jargon-general-legal) (this model) \| jargon-general-base \|18GB Legal Corpus
	\| [jargon-multidomain-base](https://huggingface.co/PantagrueLLM/jargon-multidomain-base) \| jargon-general-base \|Medical+Legal Corpora\|
	\| [jargon-legal](https://huggingface.co/PantagrueLLM/jargon-legal) \| scratch \|18GB Legal Corpus\|
	\| [jargon-legal-4096](https://huggingface.co/PantagrueLLM/jargon-legal-4096) \| scratch \|18GB Legal Corpus\|
	\| [jargon-biomed](https://huggingface.co/PantagrueLLM/jargon-biomed) \| scratch \|5.4GB Medical Corpus\|
	\| [jargon-biomed-4096](https://huggingface.co/PantagrueLLM/jargon-biomed-4096) \| scratch \|5.4GB Medical Corpus\|
	\| [jargon-NACHOS](https://huggingface.co/PantagrueLLM/jargon-NACHOS) \| scratch \|[NACHOS](https://drbert.univ-avignon.fr/)\|
	\| [jargon-NACHOS-4096](https://huggingface.co/PantagrueLLM/jargon-NACHOS-4096) \| scratch \|[NACHOS](https://drbert.univ-avignon.fr/)\|


	## Evaluation

	The Jargon models were evaluated on an range of specialized downstream tasks.

	#### Legal Domain Benchmark

	Results averaged across five funs with varying random seeds.

	\| \|[ECtHR-FR](https://huggingface.co/datasets/audibeal/fr-echr)\|[OACS](https://www.jeuxdemots.org/OACS/oacs.php)\|[SJP](https://aclanthology.org/2021.nllp-1.3/)\|
	\|-------------------------\|:-----------------------:\|:-----------------------:\|:-----------------------:\|
	\| Task Type \| Document Classification \| Document Classification \| Document Classification \|
	\| Metric \| Macro-F1 \| Macro-F1 \| Macro-F1 \|
	\| jargon-general-base \| 42.9 \| 50.8 \| 55.1 \|
	\| jargon-multidomain-base \| 44.5 \| 55.6 \| 58.1 \|
	\| jargon-general-legal \| 43.1 \| 49.9 \| 54.5 \|
	\| jargon-legal \| 44.6 \| 51.6 \| 56.7 \|
	\| jargon-legal-4096 \| 45.9 \| 54.1 \| 68.2 \|

	For more info please check out the [paper](https://hal.science/hal-04535557/file/FB2_domaines_specialises_LREC_COLING24.pdf), accepted for publication at [LREC-COLING 2024](https://lrec-coling-2024.org/list-of-accepted-papers/).


	## Using Jargon models with HuggingFace transformers

	You can get started with this model using the code snippet below:

	```python
	from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline

	tokenizer = AutoTokenizer.from_pretrained("PantagrueLLM/jargon-general-legal", trust_remote_code=True)
	model = AutoModelForMaskedLM.from_pretrained("PantagrueLLM/jargon-general-legal", trust_remote_code=True)

	jargon_maskfiller = pipeline("fill-mask", model=model, tokenizer=tokenizer)
	output = jargon_maskfiller("Il est allé au <mask> hier")
	```

	You can also use the classes `AutoModel`, `AutoModelForSequenceClassification`, or `AutoModelForTokenClassification` to load Jargon models, depending on the downstream task in question.

	- Language(s): French
	- License: MIT
	- Developed by: Vincent Segonne
	- Funded by
	- GENCI-IDRIS (Grant 2022 A0131013801)
	- French National Research Agency: Pantagruel grant ANR-23-IAS1-0001
	- MIAI@Grenoble Alpes ANR-19-P3IA-0003
	- PROPICTO ANR-20-CE93-0005
	- Lawbot ANR-20-CE38-0013
	- Swiss National Science Foundation (grant PROPICTO N°197864)
	- Authors
	- Vincent Segonne
	- Aidan Mannion
	- Laura Cristina Alonzo Canul
	- Alexandre Audibert
	- Xingyu Liu
	- Cécile Macaire
	- Adrien Pupier
	- Yongxin Zhou
	- Mathilde Aguiar
	- Felix Herron
	- Magali Norré
	- Massih-Reza Amini
	- Pierrette Bouillon
	- Iris Eshkol-Taravella
	- Emmanuelle Esperança-Rodier
	- Thomas François
	- Lorraine Goeuriot
	- Jérôme Goulian
	- Mathieu Lafourcade
	- Benjamin Lecouteux
	- François Portet
	- Fabien Ringeval
	- Vincent Vandeghinste
	- Maximin Coavoux
	- Marco Dinarelli
	- Didier Schwab



	## Citation

	If you use this model for your own research work, please cite as follows:

	```bibtex
	@inproceedings{segonne:hal-04535557,
	TITLE = {{Jargon: A Suite of Language Models and Evaluation Tasks for French Specialized Domains}},
	AUTHOR = {Segonne, Vincent and Mannion, Aidan and Alonzo Canul, Laura Cristina and Audibert, Alexandre and Liu, Xingyu and Macaire, C{\'e}cile and Pupier, Adrien and Zhou, Yongxin and Aguiar, Mathilde and Herron, Felix and Norr{\'e}, Magali and Amini, Massih-Reza and Bouillon, Pierrette and Eshkol-Taravella, Iris and Esperan{\c c}a-Rodier, Emmanuelle and Fran{\c c}ois, Thomas and Goeuriot, Lorraine and Goulian, J{\'e}r{\^o}me and Lafourcade, Mathieu and Lecouteux, Benjamin and Portet, Fran{\c c}ois and Ringeval, Fabien and Vandeghinste, Vincent and Coavoux, Maximin and Dinarelli, Marco and Schwab, Didier},
	URL = {https://hal.science/hal-04535557},
	BOOKTITLE = {{LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evaluation}},
	ADDRESS = {Turin, Italy},
	YEAR = {2024},
	MONTH = May,
	KEYWORDS = {Self-supervised learning ; Pretrained language models ; Evaluation benchmark ; Biomedical document processing ; Legal document processing ; Speech transcription},
	PDF = {https://hal.science/hal-04535557/file/FB2_domaines_specialises_LREC_COLING24.pdf},
	HAL_ID = {hal-04535557},
	HAL_VERSION = {v1},
	}
	```



	<!-- - Finetuned from model [optional]: [More Information Needed] -->
	<!--
	### Model Sources [optional]


	<!-- Provide the basic links for the model. -->