File size: 6,550 Bytes

a6e4b5a
 
 
eaaafbb
a6e4b5a
9808534
a6e4b5a
 
 
8a33bc1
a6e4b5a
 
 
99bc75d
 
eaaafbb
a6e4b5a
 
 
 
 
 
9808534
a6e4b5a
9808534
a6e4b5a
eaaafbb
a6e4b5a
9808534
a6e4b5a
 
 
 
 
 
 
 
 
 
9808534
 
a6e4b5a
 
 
 
 
 
 
 
b87718d
a6e4b5a
 
 
 
 
2b632d7
 
9808534
a6e4b5a
 
 
 
 
2b632d7
 
 
a6e4b5a
 
 
 
 
5898d4c
 
a6e4b5a
5898d4c
 
 
 
 
 
a6e4b5a
 
 
5898d4c
a6e4b5a
 
 
5898d4c
a6e4b5a
 
 
 
5898d4c
a6e4b5a
 
5898d4c
a6e4b5a
 
 
 
 
 
 
 
9808534
 
a6e4b5a
9808534
a6e4b5a
9808534
 
 
a6e4b5a
 
 
 
 
 
 
9808534
a6e4b5a
 
 
9808534
a6e4b5a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9808534
a6e4b5a
 
 
 
9808534
a6e4b5a
 
 
 
 
 
9808534
 
 
 
 
 
 
 
 
 
 
 
 
a6e4b5a

---
language: fr
tags:
- camembert
- long context
pipeline_tag: fill-mask
---

# LSG model 
**Transformers >= 4.36.1**\
**This model relies on a custom modeling file, you need to add trust_remote_code=True**\
**See [\#13467](https://github.com/huggingface/transformers/pull/13467)**

LSG ArXiv [paper](https://arxiv.org/abs/2210.15497). \
Github/conversion script is available at this [link](https://github.com/ccdv-ai/convert_checkpoint_to_lsg).

* [Usage](#usage)
* [Parameters](#parameters)
* [Sparse selection type](#sparse-selection-type)
* [Tasks](#tasks)
* [Training global tokens](#training-global-tokens)

This model is adapted from [CamemBERT-base](https://huggingface.co/camembert-base) without additional pretraining yet. It uses the same number of parameters/layers and the same tokenizer.

This model can handle long sequences but faster and more efficiently than Longformer or BigBird (from Transformers) and relies on Local + Sparse + Global attention (LSG).

The model requires sequences whose length is a multiple of the block size. The model is "adaptive" and automatically pads the sequences if needed (adaptive=True in config). It is however recommended, thanks to the tokenizer, to truncate the inputs (truncation=True) and optionally to pad with a multiple of the block size (pad_to_multiple_of=...).

Support encoder-decoder but I didnt test it extensively.\
Implemented in PyTorch.

![attn](attn.png)

## Usage
The model relies on a custom modeling file, you need to add trust_remote_code=True to use it.

```python: 
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("ccdv/lsg-camembert-base-4096", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("ccdv/lsg-camembert-base-4096")
``` 

## Parameters
You can change various parameters like : 
* the number of global tokens (num_global_tokens=1)
* local block size (block_size=128)
* sparse block size (sparse_block_size=128)
* sparsity factor (sparsity_factor=2)
* mask_first_token (mask first token since it is redundant with the first global token)
* see config.json file

Default parameters work well in practice. If you are short on memory, reduce block sizes, increase sparsity factor and remove dropout in the attention score matrix.

```python:
from transformers import AutoModel

model = AutoModel.from_pretrained("ccdv/lsg-camembert-base-4096", 
    trust_remote_code=True, 
    num_global_tokens=16,
    block_size=64,
    sparse_block_size=64,
    attention_probs_dropout_prob=0.0
    sparsity_factor=4,
    sparsity_type="none",
    mask_first_token=True
)
``` 

## Sparse selection type

There are 6 different sparse selection patterns. The best type is task dependent. \
If `sparse_block_size=0` or `sparsity_type="none"`, only local attention is considered. \
Note that for sequences with length < 2*block_size, the type has no effect.
* `sparsity_type="bos_pooling"` (new)
    * weighted average pooling using the BOS token 
    * Works best in general, especially with a rather large sparsity_factor (8, 16, 32)
    * Additional parameters:
        * None
* `sparsity_type="norm"`, select highest norm tokens
    * Works best for a small sparsity_factor (2 to 4)
    * Additional parameters:
        * None
* `sparsity_type="pooling"`, use average pooling to merge tokens
    * Works best for a small sparsity_factor (2 to 4)
    * Additional parameters:
        * None
* `sparsity_type="lsh"`, use the LSH algorithm to cluster similar tokens
    * Works best for a large sparsity_factor (4+)
    * LSH relies on random projections, thus inference may differ slightly with different seeds
    * Additional parameters:
        * lsg_num_pre_rounds=1, pre merge tokens n times before computing centroids
* `sparsity_type="stride"`, use a striding mecanism per head
    * Each head will use different tokens strided by sparsify_factor
    * Not recommended if sparsify_factor > num_heads
* `sparsity_type="block_stride"`, use a striding mecanism per head
    * Each head will use block of tokens strided by sparsify_factor
    * Not recommended if sparsify_factor > num_heads

## Tasks
Fill mask example:
```python:
from transformers import FillMaskPipeline, AutoModelForMaskedLM, AutoTokenizer

model = AutoModelForMaskedLM.from_pretrained("ccdv/lsg-camembert-base-4096", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("ccdv/lsg-camembert-base-4096")

SENTENCES = "Paris est la <mask> de la France."
pipeline = FillMaskPipeline(model, tokenizer)
output = pipeline(SENTENCES)

> 'Paris est la capitale de la France.'
```


Classification example:
```python:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained("ccdv/lsg-camembert-base-4096", 
    trust_remote_code=True, 
    pool_with_global=True, # pool with a global token instead of first token
)
tokenizer = AutoTokenizer.from_pretrained("ccdv/lsg-camembert-base-4096")

SENTENCE = "This is a test for sequence classification. " * 300
token_ids = tokenizer(
    SENTENCE, 
    return_tensors="pt", 
    #pad_to_multiple_of=... # Optional
    truncation=True
    )
output = model(**token_ids)

> SequenceClassifierOutput(loss=None, logits=tensor([[-0.3051, -0.1762]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)
```

## Training global tokens
To train global tokens and the classification head only:
```python:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained("ccdv/lsg-camembert-base-4096", 
    trust_remote_code=True, 
    pool_with_global=True, # pool with a global token instead of first token
    num_global_tokens=16
)
tokenizer = AutoTokenizer.from_pretrained("ccdv/lsg-camembert-base-4096")

for name, param in model.named_parameters():
    if "global_embeddings" not in name:
        param.requires_grad = False
    else:
        param.required_grad = True
```

**CamemBERT**
```
@inproceedings{Martin_2020,
	doi = {10.18653/v1/2020.acl-main.645},
	url = {https://doi.org/10.18653%2Fv1%2F2020.acl-main.645},
	year = 2020,
	publisher = {Association for Computational Linguistics},
	author = {Louis Martin and Benjamin Muller and Pedro Javier Ortiz Su{\'{a}}rez and Yoann Dupont and Laurent Romary and {\'{E}}ric de la Clergeri and Djam{\'{e}} Seddah and Beno{\^{\i}}t Sagot},
	title = {{CamemBERT}: a Tasty French Language Model},
	booktitle = {Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics}
}
```