Spaces:
Running
Running
update
Browse files- compression_app.py +2 -2
compression_app.py
CHANGED
@@ -28,7 +28,7 @@ from compression_util import get_compression_leaderboard, common_corpuses
|
|
28 |
docs = """## 📖 What is a good tokenizer?
|
29 |
|
30 |
From a compression perspective, a good tokenizer should be lossless,
|
31 |
-
and keep high compression rate (fewer tokens for given text).
|
32 |
The encoding and decoding process can be formulated as
|
33 |
```python
|
34 |
token_ids = tokenizer.encode(input_text) # compressed tokens
|
@@ -40,7 +40,7 @@ Lossless tokenization preserves the exact original text, i.e. `decoded_text = in
|
|
40 |
|
41 |
- Most lossy tokenizers get many out-of-vocabulary tokens. 👉 Check the [oov of bert-base-uncased](https://huggingface.co/spaces/eson/tokenizer-arena/blob/main/stats/compression_rate/google-bert.bert-base-cased%20%40%20cc100.zh-Hans.diff.json).
|
42 |
- Some other tokenizers have no oov, but still be lossy due to text normalization. For example qwen performs [unicode normalization](https://github.com/huggingface/transformers/blob/v4.42.3/src/transformers/models/qwen2/tokenization_qwen2.py#L338),
|
43 |
-
which may bring some [slight
|
44 |
|
45 |
- **Compression Rate** <br>
|
46 |
There are mainly two types of metric to represent the `input_text`:
|
|
|
28 |
docs = """## 📖 What is a good tokenizer?
|
29 |
|
30 |
From a compression perspective, a good tokenizer should be lossless,
|
31 |
+
and keep high compression rate (fewer tokens for given text). <br>
|
32 |
The encoding and decoding process can be formulated as
|
33 |
```python
|
34 |
token_ids = tokenizer.encode(input_text) # compressed tokens
|
|
|
40 |
|
41 |
- Most lossy tokenizers get many out-of-vocabulary tokens. 👉 Check the [oov of bert-base-uncased](https://huggingface.co/spaces/eson/tokenizer-arena/blob/main/stats/compression_rate/google-bert.bert-base-cased%20%40%20cc100.zh-Hans.diff.json).
|
42 |
- Some other tokenizers have no oov, but still be lossy due to text normalization. For example qwen performs [unicode normalization](https://github.com/huggingface/transformers/blob/v4.42.3/src/transformers/models/qwen2/tokenization_qwen2.py#L338),
|
43 |
+
which may bring some [slight differences](https://huggingface.co/spaces/eson/tokenizer-arena/raw/main/stats/compression_rate/Qwen.Qwen1.5-1.8B%20@%20cc100.ja.diff.json) to the reconstructed text.
|
44 |
|
45 |
- **Compression Rate** <br>
|
46 |
There are mainly two types of metric to represent the `input_text`:
|