File size: 1,904 Bytes
26f1aa0 cb3a7ba 26f1aa0 3c0d8f4 c8d6e52 3c0d8f4 26f1aa0 3c0d8f4 c8d6e52 3c0d8f4 5aab424 3c0d8f4 c8d6e52 3c0d8f4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
---
language:
- en
tags:
- sentence-similarity
- text-classification
datasets:
- dennlinger/wiki-paragraphs
metrics:
- f1
license: mit
---
# BERT-Wiki-Paragraphs
Authors: Satya Almasian\*, Dennis Aumiller\*, Lucienne-Sophie Marmé, Michael Gertz
Contact us at `<lastname>@informatik.uni-heidelberg.de`
Details for the training method can be found in our work [Structural Text Segmentation of Legal Documents](https://arxiv.org/abs/2012.03619).
The training procedure follows the same setup, but we substitute legal documents for Wikipedia in this model.
Find the associated training data here: [wiki-paragraphs](https://huggingface.co/datasets/dennlinger/wiki-paragraphs)
Training is performed in a form of weakly-supervised fashion to determine whether paragraphs topically belong together or not.
We utilize automatically generated samples from Wikipedia for training, where paragraphs from within the same section are assumed to be topically coherent.
We use the same articles as ([Koshorek et al., 2018](https://arxiv.org/abs/1803.09337)),
albeit from a 2021 dump of Wikpeida, and split at paragraph boundaries instead of the sentence level.
## Usage
Preferred usage is through `transformers.pipeline`:
```python
from transformers import pipeline
pipe = pipeline("text-classification", model="dennlinger/bert-wiki-paragraphs")
pipe("{First paragraph} [SEP] {Second paragraph}")
```
A predicted "1" means that paragraphs belong to the same topic, a "0" indicates a disconnect.
## Training Setup
The model was trained for 3 epochs from `bert-base-uncased` on paragraph pairs (limited to 512 subwork with the `longest_first` truncation strategy).
We use a batch size of 24 wit 2 iterations gradient accumulation (effective batch size of 48), and a learning rate of 1e-4, with gradient clipping at 5.
Training was performed on a single Titan RTX GPU over the duration of 3 weeks.
|