File size: 1,904 Bytes
26f1aa0
 
 
 
 
 
 
cb3a7ba
26f1aa0
 
 
 
 
3c0d8f4
 
c8d6e52
 
3c0d8f4
 
26f1aa0
3c0d8f4
 
c8d6e52
 
3c0d8f4
 
5aab424
 
 
 
 
 
 
 
 
 
 
3c0d8f4
c8d6e52
 
3c0d8f4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
---
language:
  - en
tags:
  - sentence-similarity
  - text-classification
datasets:
  - dennlinger/wiki-paragraphs
metrics:
  - f1
license: mit
---

# BERT-Wiki-Paragraphs

Authors: Satya Almasian\*, Dennis Aumiller\*, Lucienne-Sophie Marmé, Michael Gertz  
Contact us at `<lastname>@informatik.uni-heidelberg.de`  
Details for the training method can be found in our work [Structural Text Segmentation of Legal Documents](https://arxiv.org/abs/2012.03619).
The training procedure follows the same setup, but we substitute legal documents for Wikipedia in this model.
Find the associated training data here: [wiki-paragraphs](https://huggingface.co/datasets/dennlinger/wiki-paragraphs)

Training is performed in a form of weakly-supervised fashion to determine whether paragraphs topically belong together or not.
We utilize automatically generated samples from Wikipedia for training, where paragraphs from within the same section are assumed to be topically coherent.  
We use the same articles as ([Koshorek et al., 2018](https://arxiv.org/abs/1803.09337)), 
albeit from a 2021 dump of Wikpeida, and split at paragraph boundaries instead of the sentence level.

## Usage
Preferred usage is through `transformers.pipeline`:

```python
from transformers import pipeline
pipe = pipeline("text-classification", model="dennlinger/bert-wiki-paragraphs")

pipe("{First paragraph} [SEP] {Second paragraph}")
```
A predicted "1" means that paragraphs belong to the same topic, a "0" indicates a disconnect.

## Training Setup
The model was trained for 3 epochs from `bert-base-uncased` on paragraph pairs (limited to 512 subwork with the `longest_first` truncation strategy).
We use a batch size of 24 wit 2 iterations gradient accumulation (effective batch size of 48), and a learning rate of 1e-4, with gradient clipping at 5.
Training was performed on a single Titan RTX GPU over the duration of 3 weeks.