louisbrulenaudet commited on
Commit
62fa052
1 Parent(s): a128769

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +39 -22
README.md CHANGED
@@ -5,14 +5,31 @@ tags:
5
  - feature-extraction
6
  - sentence-similarity
7
  - transformers
8
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  ---
10
 
11
- # {MODEL_NAME}
12
 
13
  This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
14
 
15
- <!--- Describe your model here -->
 
 
16
 
17
  ## Usage (Sentence-Transformers)
18
 
@@ -28,13 +45,11 @@ Then you can use the model like this:
28
  from sentence_transformers import SentenceTransformer
29
  sentences = ["This is an example sentence", "Each sentence is converted"]
30
 
31
- model = SentenceTransformer('{MODEL_NAME}')
32
  embeddings = model.encode(sentences)
33
  print(embeddings)
34
  ```
35
 
36
-
37
-
38
  ## Usage (HuggingFace Transformers)
39
  Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
40
 
@@ -51,32 +66,23 @@ def cls_pooling(model_output, attention_mask):
51
  sentences = ['This is an example sentence', 'Each sentence is converted']
52
 
53
  # Load model from HuggingFace Hub
54
- tokenizer = AutoTokenizer.from_pretrained('{MODEL_NAME}')
55
- model = AutoModel.from_pretrained('{MODEL_NAME}')
56
 
57
  # Tokenize sentences
58
- encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
59
 
60
  # Compute token embeddings
61
  with torch.no_grad():
62
  model_output = model(**encoded_input)
63
 
64
  # Perform pooling. In this case, cls pooling.
65
- sentence_embeddings = cls_pooling(model_output, encoded_input['attention_mask'])
66
 
67
  print("Sentence embeddings:")
68
  print(sentence_embeddings)
69
  ```
70
 
71
-
72
-
73
- ## Evaluation Results
74
-
75
- <!--- Describe how your model was evaluated -->
76
-
77
- For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name={MODEL_NAME})
78
-
79
-
80
  ## Training
81
  The model was trained with the parameters:
82
 
@@ -96,7 +102,6 @@ Parameters of the fit()-Method:
96
  {
97
  "epochs": 1,
98
  "evaluation_steps": 0,
99
- "evaluator": "NoneType",
100
  "max_grad_norm": 1,
101
  "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
102
  "optimizer_params": {
@@ -109,7 +114,6 @@ Parameters of the fit()-Method:
109
  }
110
  ```
111
 
112
-
113
  ## Full Model Architecture
114
  ```
115
  SentenceTransformer(
@@ -120,4 +124,17 @@ SentenceTransformer(
120
 
121
  ## Citing & Authors
122
 
123
- <!--- Describe where people can find more information -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  - feature-extraction
6
  - sentence-similarity
7
  - transformers
8
+ - legal
9
+ - french-law
10
+ - droit français
11
+ - tax
12
+ - droit fiscal
13
+ - fiscalité
14
+ license: apache-2.0
15
+ pretty_name: Domain-adapted mBERT for French Tax Practice
16
+ datasets:
17
+ - louisbrulenaudet/lpf
18
+ - louisbrulenaudet/cgi
19
+ - louisbrulenaudet/code-douanes
20
+
21
+ language:
22
+ - fr
23
+ library_name: sentence-transformers
24
  ---
25
 
26
+ # Domain-adapted mBERT for French Tax Practice
27
 
28
  This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
29
 
30
+ Pretrained transformers model on the top 102 languages with the largest Wikipedia using a masked language modeling (MLM) objective, fitted using Transformer-based Sequential Denoising Auto-Encoder for unsupervised sentence embedding learning with one objective : french tax domain adaptation.
31
+
32
+ This way, the model learns an inner representation of the french legal language in the training set that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard classifier using the features produced by the model as inputs.
33
 
34
  ## Usage (Sentence-Transformers)
35
 
 
45
  from sentence_transformers import SentenceTransformer
46
  sentences = ["This is an example sentence", "Each sentence is converted"]
47
 
48
+ model = SentenceTransformer("louisbrulenaudet/tsdae-lemone-mbert-tax")
49
  embeddings = model.encode(sentences)
50
  print(embeddings)
51
  ```
52
 
 
 
53
  ## Usage (HuggingFace Transformers)
54
  Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
55
 
 
66
  sentences = ['This is an example sentence', 'Each sentence is converted']
67
 
68
  # Load model from HuggingFace Hub
69
+ tokenizer = AutoTokenizer.from_pretrained("louisbrulenaudet/tsdae-lemone-mbert-tax")
70
+ model = AutoModel.from_pretrained("louisbrulenaudet/tsdae-lemone-mbert-tax")
71
 
72
  # Tokenize sentences
73
+ encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
74
 
75
  # Compute token embeddings
76
  with torch.no_grad():
77
  model_output = model(**encoded_input)
78
 
79
  # Perform pooling. In this case, cls pooling.
80
+ sentence_embeddings = cls_pooling(model_output, encoded_input["attention_mask"])
81
 
82
  print("Sentence embeddings:")
83
  print(sentence_embeddings)
84
  ```
85
 
 
 
 
 
 
 
 
 
 
86
  ## Training
87
  The model was trained with the parameters:
88
 
 
102
  {
103
  "epochs": 1,
104
  "evaluation_steps": 0,
 
105
  "max_grad_norm": 1,
106
  "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
107
  "optimizer_params": {
 
114
  }
115
  ```
116
 
 
117
  ## Full Model Architecture
118
  ```
119
  SentenceTransformer(
 
124
 
125
  ## Citing & Authors
126
 
127
+ If you use this code in your research, please use the following BibTeX entry.
128
+
129
+ ```BibTeX
130
+ @misc{louisbrulenaudet2023,
131
+ author = {Louis Brulé Naudet},
132
+ title = {Domain-adapted mBERT for French Tax Practice},
133
+ year = {2023}
134
+ howpublished = {\url{https://huggingface.co/louisbrulenaudet/tsdae-lemone-mbert-tax}},
135
+ }
136
+ ```
137
+
138
+ ## Feedback
139
+
140
+ If you have any feedback, please reach out at [[email protected]](mailto:[email protected]).