a-mannion's picture
Update README.md
dd61435 verified
---
license: mit
language:
- fr
library_name: transformers
tags:
- linformer
- legal
- RoBERTa
- pytorch
---
# Jargon-general-legal
[Jargon](https://hal.science/hal-04535557/file/FB2_domaines_specialises_LREC_COLING24.pdf) is an efficient transformer encoder LM for French, combining the LinFormer attention mechanism with the RoBERTa model architecture.
Jargon is available in several versions with different context sizes and types of pre-training corpora.
<!-- Provide a quick summary of what the model is/does. -->
<!-- This modelcard aims to be a base template for new models. It has been generated using [this raw template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md?plain=1).
-->
| **Model** | **Initialised from...** |**Training Data**|
|-------------------------------------------------------------------------------------|:-----------------------:|:----------------:|
| [jargon-general-base](https://huggingface.co/PantagrueLLM/jargon-general-base) | scratch |8.5GB Web Corpus|
| [jargon-general-biomed](https://huggingface.co/PantagrueLLM/jargon-general-biomed) | jargon-general-base |5.4GB Medical Corpus|
| [jargon-general-legal](https://huggingface.co/PantagrueLLM/jargon-general-legal) (this model) | jargon-general-base |18GB Legal Corpus
| [jargon-multidomain-base](https://huggingface.co/PantagrueLLM/jargon-multidomain-base) | jargon-general-base |Medical+Legal Corpora|
| [jargon-legal](https://huggingface.co/PantagrueLLM/jargon-legal) | scratch |18GB Legal Corpus|
| [jargon-legal-4096](https://huggingface.co/PantagrueLLM/jargon-legal-4096) | scratch |18GB Legal Corpus|
| [jargon-biomed](https://huggingface.co/PantagrueLLM/jargon-biomed) | scratch |5.4GB Medical Corpus|
| [jargon-biomed-4096](https://huggingface.co/PantagrueLLM/jargon-biomed-4096) | scratch |5.4GB Medical Corpus|
| [jargon-NACHOS](https://huggingface.co/PantagrueLLM/jargon-NACHOS) | scratch |[NACHOS](https://drbert.univ-avignon.fr/)|
| [jargon-NACHOS-4096](https://huggingface.co/PantagrueLLM/jargon-NACHOS-4096) | scratch |[NACHOS](https://drbert.univ-avignon.fr/)|
## Evaluation
The Jargon models were evaluated on an range of specialized downstream tasks.
#### Legal Domain Benchmark
Results averaged across five funs with varying random seeds.
| |[ECtHR-FR](https://huggingface.co/datasets/audibeal/fr-echr)|[OACS](https://www.jeuxdemots.org/OACS/oacs.php)|[SJP](https://aclanthology.org/2021.nllp-1.3/)|
|-------------------------|:-----------------------:|:-----------------------:|:-----------------------:|
| **Task Type** | Document Classification | Document Classification | Document Classification |
| **Metric** | Macro-F1 | Macro-F1 | Macro-F1 |
| jargon-general-base | 42.9 | 50.8 | 55.1 |
| jargon-multidomain-base | 44.5 | 55.6 | 58.1 |
| jargon-general-legal | 43.1 | 49.9 | 54.5 |
| jargon-legal | 44.6 | 51.6 | 56.7 |
| jargon-legal-4096 | 45.9 | 54.1 | 68.2 |
For more info please check out the [paper](https://hal.science/hal-04535557/file/FB2_domaines_specialises_LREC_COLING24.pdf), accepted for publication at [LREC-COLING 2024](https://lrec-coling-2024.org/list-of-accepted-papers/).
## Using Jargon models with HuggingFace transformers
You can get started with this model using the code snippet below:
```python
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline
tokenizer = AutoTokenizer.from_pretrained("PantagrueLLM/jargon-general-legal", trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained("PantagrueLLM/jargon-general-legal", trust_remote_code=True)
jargon_maskfiller = pipeline("fill-mask", model=model, tokenizer=tokenizer)
output = jargon_maskfiller("Il est allé au <mask> hier")
```
You can also use the classes `AutoModel`, `AutoModelForSequenceClassification`, or `AutoModelForTokenClassification` to load Jargon models, depending on the downstream task in question.
- **Language(s):** French
- **License:** MIT
- **Developed by:** Vincent Segonne
- **Funded by**
- GENCI-IDRIS (Grant 2022 A0131013801)
- French National Research Agency: Pantagruel grant ANR-23-IAS1-0001
- MIAI@Grenoble Alpes ANR-19-P3IA-0003
- PROPICTO ANR-20-CE93-0005
- Lawbot ANR-20-CE38-0013
- Swiss National Science Foundation (grant PROPICTO N°197864)
- **Authors**
- Vincent Segonne
- Aidan Mannion
- Laura Cristina Alonzo Canul
- Alexandre Audibert
- Xingyu Liu
- Cécile Macaire
- Adrien Pupier
- Yongxin Zhou
- Mathilde Aguiar
- Felix Herron
- Magali Norré
- Massih-Reza Amini
- Pierrette Bouillon
- Iris Eshkol-Taravella
- Emmanuelle Esperança-Rodier
- Thomas François
- Lorraine Goeuriot
- Jérôme Goulian
- Mathieu Lafourcade
- Benjamin Lecouteux
- François Portet
- Fabien Ringeval
- Vincent Vandeghinste
- Maximin Coavoux
- Marco Dinarelli
- Didier Schwab
## Citation
If you use this model for your own research work, please cite as follows:
```bibtex
@inproceedings{segonne:hal-04535557,
TITLE = {{Jargon: A Suite of Language Models and Evaluation Tasks for French Specialized Domains}},
AUTHOR = {Segonne, Vincent and Mannion, Aidan and Alonzo Canul, Laura Cristina and Audibert, Alexandre and Liu, Xingyu and Macaire, C{\'e}cile and Pupier, Adrien and Zhou, Yongxin and Aguiar, Mathilde and Herron, Felix and Norr{\'e}, Magali and Amini, Massih-Reza and Bouillon, Pierrette and Eshkol-Taravella, Iris and Esperan{\c c}a-Rodier, Emmanuelle and Fran{\c c}ois, Thomas and Goeuriot, Lorraine and Goulian, J{\'e}r{\^o}me and Lafourcade, Mathieu and Lecouteux, Benjamin and Portet, Fran{\c c}ois and Ringeval, Fabien and Vandeghinste, Vincent and Coavoux, Maximin and Dinarelli, Marco and Schwab, Didier},
URL = {https://hal.science/hal-04535557},
BOOKTITLE = {{LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evaluation}},
ADDRESS = {Turin, Italy},
YEAR = {2024},
MONTH = May,
KEYWORDS = {Self-supervised learning ; Pretrained language models ; Evaluation benchmark ; Biomedical document processing ; Legal document processing ; Speech transcription},
PDF = {https://hal.science/hal-04535557/file/FB2_domaines_specialises_LREC_COLING24.pdf},
HAL_ID = {hal-04535557},
HAL_VERSION = {v1},
}
```
<!-- - **Finetuned from model [optional]:** [More Information Needed] -->
<!--
### Model Sources [optional]
<!-- Provide the basic links for the model. -->