xu-song commited on
Commit
e9d62ec
1 Parent(s): 0bba168

update doc

Browse files
Files changed (1) hide show
  1. compression_app.py +6 -4
compression_app.py CHANGED
@@ -38,16 +38,18 @@ The encoding and decoding process can be formulated as
38
  - **Lossless** <br>
39
  Lossless tokenization preserves the exact original text, i.e. `decoded_text = input_text`.
40
 
41
- - Most lossy tokenizers get many out-of-vocabulary tokens. 👉 Check the
42
- oov of [bert](https://huggingface.co/spaces/eson/tokenizer-arena/blob/main/stats/compression_rate/google-bert.bert-base-cased%20%40%20cc100.zh-Hans.diff.json) and
43
  [t5](https://huggingface.co/spaces/eson/tokenizer-arena/blob/main/stats/compression_rate/google-t5.t5-large%20%40%20cc100.es.diff.json).
44
- - Some other tokenizers have no oov, but still be lossy due to text normalization. For example, qwen performs [unicode normalization](https://github.com/huggingface/transformers/blob/v4.42.3/src/transformers/models/qwen2/tokenization_qwen2.py#L338) in encoding process,
45
  llama performs [clean_up_tokenization_spaces](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/tokenizer_config.json#L2053) in decoding process,
46
  which may bring some slight differences to the reconstructed text. 👉 Check the diff of
47
  [qwen](https://huggingface.co/spaces/eson/tokenizer-arena/raw/main/stats/compression_rate/Qwen.Qwen1.5-1.8B%20@%20cc100.ja.diff.json) and
48
  [llama](https://huggingface.co/spaces/eson/tokenizer-arena/raw/main/stats/compression_rate/meta-llama.Meta-Llama-3.1-405B%20@%20cc100.en.diff.json).
49
 
50
 
 
 
51
  - **Compression Rate** <br>
52
  There are mainly two types of metric to represent the `input_text`:
53
  - `char-level`: the number of characters in the given text
@@ -144,7 +146,7 @@ with gr.Blocks(theme=theme) as demo:
144
  # "- `g_bytes/b_tokens` measures how many gigabytes corpus per billion tokens.\n"
145
  # "- `t_bytes/t_tokens` measures how many terabytes corpus per trillion tokens.\n"
146
  " - `char/token` measures how many chars per token on the tokenized corpus.\n"
147
- " - `oov_ratio`: out-of-vocabulary ratio on the selected corpus, 👉 get [oov charset](https://huggingface.co/spaces/eson/tokenizer-arena/raw/main/stats/compression_rate.json)\n\n"
148
  "You can reproduce this procedure with [compression_util.py](https://huggingface.co/spaces/eson/tokenizer-arena/blob/main/compression_util.py)."
149
  )
150
 
 
38
  - **Lossless** <br>
39
  Lossless tokenization preserves the exact original text, i.e. `decoded_text = input_text`.
40
 
41
+ - Most lossy tokenizers get many out-of-vocabulary(OOV) words. 👉 Check the
42
+ OOV of [bert](https://huggingface.co/spaces/eson/tokenizer-arena/blob/main/stats/compression_rate/google-bert.bert-base-cased%20%40%20cc100.zh-Hans.diff.json) and
43
  [t5](https://huggingface.co/spaces/eson/tokenizer-arena/blob/main/stats/compression_rate/google-t5.t5-large%20%40%20cc100.es.diff.json).
44
+ - Even if a tokenizer has no OOV, it can be lossy due to text normalization. For example, qwen performs [unicode normalization](https://github.com/huggingface/transformers/blob/v4.42.3/src/transformers/models/qwen2/tokenization_qwen2.py#L338) in encoding process,
45
  llama performs [clean_up_tokenization_spaces](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/tokenizer_config.json#L2053) in decoding process,
46
  which may bring some slight differences to the reconstructed text. 👉 Check the diff of
47
  [qwen](https://huggingface.co/spaces/eson/tokenizer-arena/raw/main/stats/compression_rate/Qwen.Qwen1.5-1.8B%20@%20cc100.ja.diff.json) and
48
  [llama](https://huggingface.co/spaces/eson/tokenizer-arena/raw/main/stats/compression_rate/meta-llama.Meta-Llama-3.1-405B%20@%20cc100.en.diff.json).
49
 
50
 
51
+
52
+
53
  - **Compression Rate** <br>
54
  There are mainly two types of metric to represent the `input_text`:
55
  - `char-level`: the number of characters in the given text
 
146
  # "- `g_bytes/b_tokens` measures how many gigabytes corpus per billion tokens.\n"
147
  # "- `t_bytes/t_tokens` measures how many terabytes corpus per trillion tokens.\n"
148
  " - `char/token` measures how many chars per token on the tokenized corpus.\n"
149
+ " - `oov_ratio`: out-of-vocabulary ratio on the selected corpus, 👉 get [OOV charset](https://huggingface.co/spaces/eson/tokenizer-arena/raw/main/stats/compression_rate.json)\n\n"
150
  "You can reproduce this procedure with [compression_util.py](https://huggingface.co/spaces/eson/tokenizer-arena/blob/main/compression_util.py)."
151
  )
152