Migrate model card from transformers-repo
Browse filesRead announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/google/reformer-enwik8/README.md
README.md
ADDED
@@ -0,0 +1,57 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
## Reformer Language model on character level and trained on enwik8.
|
2 |
+
|
3 |
+
*enwik8* is a dataset based on Wikipedia and is often used to measure the model's ability to *compress* data, *e.g.* in
|
4 |
+
the scope of the *Hutter prize*: https://en.wikipedia.org/wiki/Hutter_Prize.
|
5 |
+
|
6 |
+
`reformer-enwik8` was pretrained on the first 90M chars of *enwik8* whereas the text was chunked into batches of size 65536 chars (=2^16).
|
7 |
+
The model's weights were taken from https://console.cloud.google.com/storage/browser/trax-ml/reformer/enwik8 and converted
|
8 |
+
to Hugging Face's PyTorch ReformerLM model `ReformerModelWithLMHead`.
|
9 |
+
|
10 |
+
The model is a language model that operates on characters.
|
11 |
+
Therefore, this model does not need a tokenizer. The following function can instead be used for **encoding** and **decoding**:
|
12 |
+
|
13 |
+
```python
|
14 |
+
import torch
|
15 |
+
|
16 |
+
# Encoding
|
17 |
+
def encode(list_of_strings, pad_token_id=0):
|
18 |
+
max_length = max([len(string) for string in list_of_strings])
|
19 |
+
|
20 |
+
# create emtpy tensors
|
21 |
+
attention_masks = torch.zeros((len(list_of_strings), max_length), dtype=torch.long)
|
22 |
+
input_ids = torch.full((len(list_of_strings), max_length), pad_token_id, dtype=torch.long)
|
23 |
+
|
24 |
+
for idx, string in enumerate(list_of_strings):
|
25 |
+
# make sure string is in byte format
|
26 |
+
if not isinstance(string, bytes):
|
27 |
+
string = str.encode(string)
|
28 |
+
|
29 |
+
input_ids[idx, :len(string)] = torch.tensor([x + 2 for x in string])
|
30 |
+
attention_masks[idx, :len(string)] = 1
|
31 |
+
|
32 |
+
return input_ids, attention_masks
|
33 |
+
|
34 |
+
# Decoding
|
35 |
+
def decode(outputs_ids):
|
36 |
+
decoded_outputs = []
|
37 |
+
for output_ids in outputs_ids.tolist():
|
38 |
+
# transform id back to char IDs < 2 are simply transformed to ""
|
39 |
+
decoded_outputs.append("".join([chr(x - 2) if x > 1 else "" for x in output_ids]))
|
40 |
+
return decoded_outputs
|
41 |
+
```
|
42 |
+
|
43 |
+
Text can be generated as follows:
|
44 |
+
|
45 |
+
```python
|
46 |
+
from transformers import ReformerModelWithLMHead
|
47 |
+
|
48 |
+
model = ReformerModelWithLMHead.from_pretrained("google/reformer-enwik8")
|
49 |
+
encoded, attention_masks = encode(["In 1965, Brooks left IBM to found the Department of"])
|
50 |
+
decode(model.generate(encoded, do_sample=True, max_length=150))
|
51 |
+
|
52 |
+
# gives:
|
53 |
+
# In 1965, Brooks left IBM to found the Department of Journalism in 1968. IBM had jurisdiction himself in 1980, while Brooks resolved, nevertheless thro
|
54 |
+
|
55 |
+
```
|
56 |
+
|
57 |
+
***Note***: Language generation using `ReformerModelWithLMHead` is not optimized yet and is rather slow.
|