Rahkakavee Baskaran
commited on
Commit
•
66bfffb
1
Parent(s):
3faa039
add demo
Browse files
README.md
CHANGED
@@ -65,7 +65,6 @@ license: cc-by-4.0
|
|
65 |
- **Finetuned from model:** "bert-base-german-case. For more information on the model check on [this model card](https://huggingface.co/bert-base-german-cased)"
|
66 |
- **license**: cc-by-4.0
|
67 |
|
68 |
-
|
69 |
## Model Sources
|
70 |
|
71 |
- **Repository**:
|
@@ -120,11 +119,13 @@ output = pipeline(queries)
|
|
120 |
|
121 |
The input data must be a list of dictionaries. Each dictionary must contain the keys 'id' and 'title'. The key title is the input for the pipeline. The output is again a list of dictionaries containing the id, the title and the key 'prediction' with the prediction of the algorithm.
|
122 |
|
|
|
|
|
123 |
## Classification Process
|
124 |
|
125 |
The classification is realized using semantic search. For this purpose, both the taxonomy and the queries, in this case dataset titles, are embedded with the model. Using cosine similarity, the label with the highest similarity to the query is determined.
|
126 |
|
127 |
-
![](assets/semantic_search.png)
|
128 |
|
129 |
## Direct Use
|
130 |
|
@@ -140,11 +141,10 @@ The model has some limititations. The model has some limitations in terms of the
|
|
140 |
|
141 |
## Training Details
|
142 |
|
143 |
-
|
144 |
|
145 |
You can find all information about the training data [here](https://huggingface.co/datasets/and-effect/mdk_gov_data_titles_clf). For the Fine Tuning we used the revision 172e61bb1dd20e43903f4c51e5cbec61ec9ae6e6 of the data, since the performance was better with this previous version of the data. We additionally applied [AugmentedSBERT]("https://www.sbert.net/examples/training/data_augmentation/README.html) to extend the dataset for better performance.
|
146 |
|
147 |
-
## Training Procedure
|
148 |
|
149 |
### Preprocessing
|
150 |
|
@@ -160,7 +160,8 @@ The model is fine tuned with similar and dissimilar pairs. Similar pairs are bui
|
|
160 |
| test_unsimilar_pairs | 249 |
|
161 |
|
162 |
We trained a CrossEncoder based on this data and used it again to generate new samplings based on the dataset titles (silver data). Using both we then fine tuned a bi-encoder, representing the resulting model.
|
163 |
-
|
|
|
164 |
|
165 |
The model was trained with the parameters:
|
166 |
|
@@ -170,7 +171,7 @@ The model was trained with the parameters:
|
|
170 |
**Loss**:
|
171 |
`sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss`
|
172 |
|
173 |
-
|
174 |
|
175 |
```json
|
176 |
{
|
|
|
65 |
- **Finetuned from model:** "bert-base-german-case. For more information on the model check on [this model card](https://huggingface.co/bert-base-german-cased)"
|
66 |
- **license**: cc-by-4.0
|
67 |
|
|
|
68 |
## Model Sources
|
69 |
|
70 |
- **Repository**:
|
|
|
119 |
|
120 |
The input data must be a list of dictionaries. Each dictionary must contain the keys 'id' and 'title'. The key title is the input for the pipeline. The output is again a list of dictionaries containing the id, the title and the key 'prediction' with the prediction of the algorithm.
|
121 |
|
122 |
+
If you want to predict only a few titles or test the model, you can also take a look at our algorithm demo [here](https://huggingface.co/spaces/and-effect/Musterdatenkatalog).
|
123 |
+
|
124 |
## Classification Process
|
125 |
|
126 |
The classification is realized using semantic search. For this purpose, both the taxonomy and the queries, in this case dataset titles, are embedded with the model. Using cosine similarity, the label with the highest similarity to the query is determined.
|
127 |
|
128 |
+
![Semmantic Search](assets/semantic_search.png)
|
129 |
|
130 |
## Direct Use
|
131 |
|
|
|
141 |
|
142 |
## Training Details
|
143 |
|
144 |
+
### Training Data
|
145 |
|
146 |
You can find all information about the training data [here](https://huggingface.co/datasets/and-effect/mdk_gov_data_titles_clf). For the Fine Tuning we used the revision 172e61bb1dd20e43903f4c51e5cbec61ec9ae6e6 of the data, since the performance was better with this previous version of the data. We additionally applied [AugmentedSBERT]("https://www.sbert.net/examples/training/data_augmentation/README.html) to extend the dataset for better performance.
|
147 |
|
|
|
148 |
|
149 |
### Preprocessing
|
150 |
|
|
|
160 |
| test_unsimilar_pairs | 249 |
|
161 |
|
162 |
We trained a CrossEncoder based on this data and used it again to generate new samplings based on the dataset titles (silver data). Using both we then fine tuned a bi-encoder, representing the resulting model.
|
163 |
+
|
164 |
+
### Training Parameter
|
165 |
|
166 |
The model was trained with the parameters:
|
167 |
|
|
|
171 |
**Loss**:
|
172 |
`sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss`
|
173 |
|
174 |
+
Hyperparameters:
|
175 |
|
176 |
```json
|
177 |
{
|