DeDeckerThomas commited on
Commit
6bfff72
β€’
1 Parent(s): 8ca8633

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +45 -22
README.md CHANGED
@@ -27,13 +27,16 @@ model-index:
27
  value: 0.588
28
  name: F1-score
29
  ---
30
- # πŸ”‘ Keyphrase Extraction model: KBIR-inspec
31
- Keyphrase extraction is a technique in text analysis where you extract the important keyphrases from a text. Since this is a time-consuming process, Artificial Intelligence is used to automate it. Currently, classical machine learning methods, that use statistics and linguistics, are widely used for the extraction process. The fact that these methods have been widely used in the community has the advantage that there are many easy-to-use libraries. Now with the recent innovations in NLP, transformers can be used to improve keyphrase extraction. Transformers also focus on the semantics and context of a document, which is quite an improvement.
 
 
 
32
 
33
 
34
  ## πŸ““ Model Description
35
- This model is a fine-tuned KBIR model on the Inspec dataset. KBIR or Keyphrase Boundary Infilling with Replacement is a pre-trained model which utilizes a multi-task learning setup for optimizing a combined loss of Masked Language Modeling (MLM), Keyphrase Boundary Infilling (KBI) and Keyphrase Replacement Classification (KRC).
36
- You can find more information about the architecture in this paper: https://arxiv.org/abs/2112.08547.
37
 
38
  The model is fine-tuned as a token classification problem where the text is labeled using the BIO scheme.
39
 
@@ -47,13 +50,13 @@ Kulkarni, Mayank, Debanjan Mahata, Ravneet Arora, and Rajarshi Bhowmik. "Learnin
47
 
48
  Sahrawat, Dhruva, Debanjan Mahata, Haimin Zhang, Mayank Kulkarni, Agniv Sharma, Rakesh Gosangi, Amanda Stent, Yaman Kumar, Rajiv Ratn Shah, and Roger Zimmermann. "Keyphrase extraction as sequence labeling using contextualized embeddings." In European Conference on Information Retrieval, pp. 328-335. Springer, Cham, 2020.
49
 
50
- ## βœ‹ Intended uses & limitations
51
  ### πŸ›‘ Limitations
52
  * This keyphrase extraction model is very domain-specific and will perform very well on abstracts of scientific papers. It's not recommended to use this model for other domains, but you are free to test it out.
53
  * Only works for English documents.
54
- * For a custom model, please consult the training notebook for more information (link incoming).
55
 
56
- ### ❓ How to use
57
  ```python
58
  from transformers import (
59
  TokenClassificationPipeline,
@@ -86,6 +89,7 @@ class KeyphraseExtractionPipeline(TokenClassificationPipeline):
86
  # Load pipeline
87
  model_name = "ml6team/keyphrase-extraction-kbir-inspec"
88
  extractor = KeyphraseExtractionPipeline(model=model_name)
 
89
  ```
90
  ```python
91
  # Inference
@@ -97,32 +101,30 @@ are widely used for the extraction process. The fact that these methods have bee
97
  has the advantage that there are many easy-to-use libraries. Now with the recent innovations in NLP,
98
  transformers can be used to improve keyphrase extraction. Transformers also focus on the semantics
99
  and context of a document, which is quite an improvement.
100
- """.replace(
101
- "\n", ""
102
- )
103
 
104
  keyphrases = extractor(text)
105
 
106
  print(keyphrases)
 
107
  ```
108
 
109
  ```
110
  # Output
111
- ['Artificial Intelligence', 'Keyphrase extraction', 'NLP',
112
- 'classical machine learning', 'keyphrase extraction',
113
- 'linguistics', 'semantics', 'statistics', 'text analysis',
114
- 'transformers']
115
  ```
116
 
117
  ## πŸ“š Training Dataset
118
- Inspec is a keyphrase extraction/generation dataset consisting of 2000 English scientific papers from the scientific domains of Computers and Control and Information Technology published between 1998 to 2002. The keyphrases are annotated by professional indexers or editors.
119
 
120
- You can find more information here: https://huggingface.co/datasets/midas/inspec
121
 
122
- ## πŸ‘·β€β™‚οΈ Training procedure
123
- For more in detail information, you can take a look at the training notebook (link incoming).
124
 
125
- ### Training parameters
126
 
127
  | Parameter | Value |
128
  | --------- | ------|
@@ -132,12 +134,26 @@ For more in detail information, you can take a look at the training notebook (li
132
 
133
  ### Preprocessing
134
  The documents in the dataset are already preprocessed into list of words with the corresponding labels. The only thing that must be done is tokenization and the realignment of the labels so that they correspond with the right subword tokens.
 
135
  ```python
 
 
 
136
  # Labels
137
  label_list = ["B", "I", "O"]
138
  lbl2idx = {"B": 0, "I": 1, "O": 2}
139
  idx2label = {0: "B", 1: "I", 2: "O"}
140
 
 
 
 
 
 
 
 
 
 
 
141
  def preprocess_fuction(all_samples_per_split):
142
  tokenized_samples = tokenizer.batch_encode_plus(
143
  all_samples_per_split[dataset_document_column],
@@ -171,10 +187,17 @@ def preprocess_fuction(all_samples_per_split):
171
  total_adjusted_labels.append(adjusted_label_ids)
172
  tokenized_samples["labels"] = total_adjusted_labels
173
  return tokenized_samples
 
 
 
 
 
 
 
174
  ```
175
 
176
  ### Postprocessing
177
- For the post-processing, you will need to filter out the B and I labeled tokens and concat the consecutive Bs and Is. As last you strip the keyphrase to ensure all spaces are removed.
178
  ```python
179
  # Define post_process functions
180
  def concat_tokens_by_tag(keyphrases):
@@ -208,14 +231,14 @@ def extract_keyphrases(example, predictions, tokenizer, index=0):
208
  ```
209
  ## πŸ“ Evaluation results
210
 
211
- One of the traditional evaluation methods is the precision, recall and F1-score @k,m where k is the number that stands for the first k predicted keyphrases and m for the average amount of predicted keyphrases.
212
  The model achieves the following results on the Inspec test set:
213
 
214
  | Dataset | P@5 | R@5 | F1@5 | P@10 | R@10 | F1@10 | P@M | R@M | F1@M |
215
  |:-----------------:|:----:|:----:|:----:|:----:|:----:|:-----:|:----:|:----:|:----:|
216
  | Inspec Test Set | 0.53 | 0.47 | 0.46 | 0.36 | 0.58 | 0.41 | 0.58 | 0.60 | 0.56 |
217
 
218
- For more information on the evaluation process, you can take a look at the keyphrase extraction evaluation notebook.
219
 
220
  ## 🚨 Issues
221
  Please feel free to start discussions in the Community Tab.
 
27
  value: 0.588
28
  name: F1-score
29
  ---
30
+ # πŸ”‘ Keyphrase Extraction Model: KBIR-inspec
31
+ Keyphrase extraction is a technique in text analysis where you extract the important keyphrases from a document. Thanks to these keyphrases humans can understand the content of a text very quickly and easily without reading it completely. Keyphrase extraction was first done primarily by human annotators, who read the text in detail and then wrote down the most important keyphrases. The disadvantage is that if you work with a lot of documents, this process can take a lot of time ⏳.
32
+
33
+ Here is where Artificial Intelligence πŸ€– comes in. Currently, classical machine learning methods, that use statistical and linguistic features, are widely used for the extraction process. Now with deep learning, it is possible to capture the semantic meaning of a text even better than these classical methods. Classical methods look at the frequency, occurrence and order of words in the text, whereas these neural approaches can capture long-term semantic dependencies and context of words in a text.
34
+
35
 
36
 
37
  ## πŸ““ Model Description
38
+ This model uses [KBIR](https://huggingface.co/bloomberg/KBIR) as its base model and fine-tunes it on the [Inspec dataset](https://huggingface.co/datasets/midas/inspec). KBIR or Keyphrase Boundary Infilling with Replacement is a pre-trained model which utilizes a multi-task learning setup for optimizing a combined loss of Masked Language Modeling (MLM), Keyphrase Boundary Infilling (KBI) and Keyphrase Replacement Classification (KRC).
39
+ You can find more information about the architecture in this [paper](https://arxiv.org/abs/2112.08547).
40
 
41
  The model is fine-tuned as a token classification problem where the text is labeled using the BIO scheme.
42
 
 
50
 
51
  Sahrawat, Dhruva, Debanjan Mahata, Haimin Zhang, Mayank Kulkarni, Agniv Sharma, Rakesh Gosangi, Amanda Stent, Yaman Kumar, Rajiv Ratn Shah, and Roger Zimmermann. "Keyphrase extraction as sequence labeling using contextualized embeddings." In European Conference on Information Retrieval, pp. 328-335. Springer, Cham, 2020.
52
 
53
+ ## βœ‹ Intended Uses & Limitations
54
  ### πŸ›‘ Limitations
55
  * This keyphrase extraction model is very domain-specific and will perform very well on abstracts of scientific papers. It's not recommended to use this model for other domains, but you are free to test it out.
56
  * Only works for English documents.
57
+ * For a custom model, please consult the [training notebook]() for more information.
58
 
59
+ ### ❓ How To Use
60
  ```python
61
  from transformers import (
62
  TokenClassificationPipeline,
 
89
  # Load pipeline
90
  model_name = "ml6team/keyphrase-extraction-kbir-inspec"
91
  extractor = KeyphraseExtractionPipeline(model=model_name)
92
+
93
  ```
94
  ```python
95
  # Inference
 
101
  has the advantage that there are many easy-to-use libraries. Now with the recent innovations in NLP,
102
  transformers can be used to improve keyphrase extraction. Transformers also focus on the semantics
103
  and context of a document, which is quite an improvement.
104
+ """
 
 
105
 
106
  keyphrases = extractor(text)
107
 
108
  print(keyphrases)
109
+
110
  ```
111
 
112
  ```
113
  # Output
114
+ ['Artificial Intelligence' 'Keyphrase extraction' 'NLP'
115
+ 'keyphrase extraction' 'linguistics' 'machine learning' 'semantics'
116
+ 'statistics' 'text analysis']
 
117
  ```
118
 
119
  ## πŸ“š Training Dataset
120
+ [Inspec](https://huggingface.co/datasets/midas/inspec) is a keyphrase extraction/generation dataset consisting of 2000 English scientific papers from the scientific domains of Computers and Control and Information Technology published between 1998 to 2002. The keyphrases are annotated by professional indexers or editors.
121
 
122
+ You can find more information in the [paper](https://dl.acm.org/doi/10.3115/1119355.1119383).
123
 
124
+ ## πŸ‘·β€β™‚οΈ Training Procedure
125
+ For more in detail information, you can take a look at the [training notebook]().
126
 
127
+ ### Training Parameters
128
 
129
  | Parameter | Value |
130
  | --------- | ------|
 
134
 
135
  ### Preprocessing
136
  The documents in the dataset are already preprocessed into list of words with the corresponding labels. The only thing that must be done is tokenization and the realignment of the labels so that they correspond with the right subword tokens.
137
+
138
  ```python
139
+ from datasets import load_dataset
140
+ from transformers import AutoTokenizer
141
+
142
  # Labels
143
  label_list = ["B", "I", "O"]
144
  lbl2idx = {"B": 0, "I": 1, "O": 2}
145
  idx2label = {0: "B", 1: "I", 2: "O"}
146
 
147
+ # Tokenizer
148
+ tokenizer = AutoTokenizer.from_pretrained("bloomberg/KBIR", add_prefix_space=True)
149
+ max_length = 512
150
+
151
+ # Dataset parameters
152
+ dataset_full_name = "midas/inspec"
153
+ dataset_subset = "raw"
154
+ dataset_document_column = "document"
155
+ dataset_biotags_column = "doc_bio_tags"
156
+
157
  def preprocess_fuction(all_samples_per_split):
158
  tokenized_samples = tokenizer.batch_encode_plus(
159
  all_samples_per_split[dataset_document_column],
 
187
  total_adjusted_labels.append(adjusted_label_ids)
188
  tokenized_samples["labels"] = total_adjusted_labels
189
  return tokenized_samples
190
+
191
+ # Load dataset
192
+ dataset = load_dataset(dataset_full_name, dataset_subset)
193
+
194
+ # Preprocess dataset
195
+ tokenized_dataset = dataset.map(preprocess_fuction, batched=True)
196
+
197
  ```
198
 
199
  ### Postprocessing
200
+ For the post-processing, you will need to filter out the B and I labeled tokens and concat the consecutive Bs and Is. As last you strip the keyphrases to ensure all spaces are removed.
201
  ```python
202
  # Define post_process functions
203
  def concat_tokens_by_tag(keyphrases):
 
231
  ```
232
  ## πŸ“ Evaluation results
233
 
234
+ Traditional evaluation methods are the precision, recall and F1-score @k,m where k is the number that stands for the first k predicted keyphrases and m for the average amount of predicted keyphrases.
235
  The model achieves the following results on the Inspec test set:
236
 
237
  | Dataset | P@5 | R@5 | F1@5 | P@10 | R@10 | F1@10 | P@M | R@M | F1@M |
238
  |:-----------------:|:----:|:----:|:----:|:----:|:----:|:-----:|:----:|:----:|:----:|
239
  | Inspec Test Set | 0.53 | 0.47 | 0.46 | 0.36 | 0.58 | 0.41 | 0.58 | 0.60 | 0.56 |
240
 
241
+ For more information on the evaluation process, you can take a look at the keyphrase extraction [evaluation notebook]().
242
 
243
  ## 🚨 Issues
244
  Please feel free to start discussions in the Community Tab.