File size: 9,165 Bytes
7c73423
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b0c41e0
299dc31
7c73423
 
 
 
 
 
f2cec45
1706767
7c73423
f2cec45
 
 
 
 
 
 
 
7c73423
e9d62ec
f2cec45
 
7c73423
47e1616
 
7c73423
47e1616
 
 
0bba168
7c73423
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1706767
7c73423
 
 
47e1616
b0c41e0
efd894c
b0c41e0
47e1616
7c73423
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6ef6bf4
 
7c73423
 
 
 
 
 
 
 
 
 
 
 
 
6ef6bf4
 
7c73423
 
 
 
 
6ef6bf4
 
7c73423
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
"""

TODO:

- 统计 tokenizer_impl

- 统计 OOV

- 统计 reversal

- 增加 math,code







## balance



- 高压缩率 VS vocab_size:

    - 高压缩率,就意味着,编码后的token数少,那么 token长度 就会长,--> vocab_size 就会太大

- 高压缩率 VS 无损

    - s

- OOV

    - OOV 多,那么生成的 UNK 可能多(一个char 一个UNK) --> token 数目多 -> 压缩率低

    - OOV 多,那么生成的 UNK 可能少() --> token 数目多 -> 压缩率低



"""

import gradio as gr
from compression_util import get_compression_leaderboard, common_corpuses


# From the perspective of compression
# exactly reconstructed from compressed tokens
docs = """## 📖 What is a good tokenizer?



From a compression perspective, a good tokenizer should be lossless,

and keep high compression rate (fewer tokens for given text). <br>

The encoding and decoding process can be formulated as

```python

    token_ids = tokenizer.encode(input_text)    # compressed tokens

    decoded_text = tokenizer.decode(token_ids)  # reconstructed text

```



**Lossless**<br>

Lossless tokenization preserves the exact original text, i.e. `decoded_text = input_text`. There are mainly two causes of compression loss. 



1. `OOV`: Most lossy tokenizers get many out-of-vocabulary(OOV) words. 👉 Check the OOV and

tokenization loss of [bert](https://huggingface.co/spaces/eson/tokenizer-arena/blob/main/stats/compression_rate/google-bert.bert-base-cased%20%40%20cc100.zh-Hans.diff.json) and 

[t5](https://huggingface.co/spaces/eson/tokenizer-arena/blob/main/stats/compression_rate/google-t5.t5-large%20%40%20cc100.es.diff.json).

2. `Normalization`: Even if a tokenizer has no OOV, it can be lossy due to text normalization. For example, qwen performs [unicode normalization](https://github.com/huggingface/transformers/blob/v4.42.3/src/transformers/models/qwen2/tokenization_qwen2.py#L338) in encoding process,

llama performs [clean_up_tokenization_spaces](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/tokenizer_config.json#L2053) in decoding process,

which may bring some slight differences to the reconstructed text. 👉 Check the tokenization loss of 

[qwen](https://huggingface.co/spaces/eson/tokenizer-arena/raw/main/stats/compression_rate/Qwen.Qwen1.5-1.8B%20@%20cc100.ja.diff.json) and

[llama](https://huggingface.co/spaces/eson/tokenizer-arena/raw/main/stats/compression_rate/meta-llama.Meta-Llama-3.1-405B%20@%20cc100.en.diff.json).







**Compression Rate**<br>

There are mainly two types of metric to represent the `input_text`:

  - `char-level`: the number of characters in the given text

  - `byte-level`: the number of bytes in the given text.



To evaluate compression rate, simple metrics can be "how many chars per token" or "how many bytes per token". <br>

In this leaderboard, we adopt more frequently used metric: "how many chars per token" and

"how many billion tokens per gigabytes corpus", i.e. `char/token` and `b_tokens/g_bytes`.

💬 [Discussion is Welcome](https://huggingface.co/spaces/eson/tokenizer-arena/discussions)

"""



# theme = gr.themes.Monochrome()
theme = gr.themes.Default()
# theme.set(accordion_text_weight=600)  # 暂不支持
with gr.Blocks(theme=theme) as demo:
    # gr.Markdown("## Convertor")
    # with gr.Accordion("Convertor", open=False):
    #     gr.Markdown("Tokenize {} corpus")
    #     with gr.Row(elem_classes="no-border"):
    #         gr.Button("File Size", min_width=50)
    #         file_size = gr.Textbox(
    #             show_label=False,
    #             min_width=50,
    #             # elem_classes="textbox-as-text"
    #         )
    #         gr.Dropdown(
    #             choices=['MB', 'GB', 'TB'],
    #             show_label=False,
    #             min_width=15,
    #             # elem_classes="textbox-as-text"
    #         )
    #         # gr.Markdown('<h2 align="center">≈</h2>')
    #         # gr.HTML('<h2 style="margin: auto;">≈</h2>')
    #         gr.Button(
    #             "≈",
    #             min_width=10,
    #             elem_classes="button-white h2-font"
    #
    #         )
    #
    #         gr.Button(
    #             "Tokens",
    #             min_width=50
    #         )
    #         gr.Textbox(
    #             show_label=False,
    #             min_width=50
    #         )
    #         gr.Dropdown(
    #             ['million', 'billion', 'trillion'],
    #             show_label=False,
    #             min_width=15,
    #             elem_classes="button-white"
    #         )



    gr.Markdown(docs)
    gr.Markdown("## 🛠️ Setting")  # ⚙
    gr.Markdown("We perform tokenization on different corpus, and calculate the compression rate."
                "")
    with gr.Accordion("Please select the corpus and measure of compression rate.", open=True):
        # file size 💽 🖴, tokens 🧮
        # Total amount of disk used
        with gr.Row():
            with gr.Column():
                compress_rate_corpus = gr.Dropdown(
                    common_corpuses,  # , "code"
                    value=["cc100/en", "cc100/zh-Hans", "cc100/fr", "cc100/es"],
                    label="corpus",
                    multiselect=True
                    # info=""
                )

                # unit of file_size: gigabyte terabyte
                # unit of token_num: million billion trillion
                # The most common units of measurement include length (meter, inch, foot), weight (gram, kilogram, pound), volume (liter, gallon, milliliter), time (second, minute, hour)
                compress_rate_unit = gr.Radio(
                    ["b_tokens/g_bytes", "t_tokens/t_bytes"],
                    value="b_tokens/g_bytes",
                    label="measure",  # evaluation metric
                )

            gr.Markdown(
                # "Note:\n\n  explanation"
                # "Supported languages are (20): arabic (ar), bulgarian (bg), german (de), modern greek (el), english (en), spanish (es), french (fr), hindi (hi), italian (it), japanese (ja), dutch (nl), polish (pl), portuguese (pt), russian (ru), swahili (sw), thai (th), turkish (tr), urdu (ur), vietnamese (vi), and chinese (zh)."
                # " arabic (ar), english (en), spanish (es), french (fr), italian (it), japanese (ja), portuguese (pt), russian (ru), and chinese (zh)."
                "- `corpus`: tokenization is performed on the selected subsets of [cc100](https://huggingface.co/datasets/statmt/cc100) corpus.\n"
                "- measure\n"
                "  - `b_tokens/g_bytes` measures how many billion tokens per gigabytes corpus.\n"
                "  - `t_tokens/t_bytes` measures how many trillion tokens per terabytes corpus.\n"
                # "- `g_bytes/b_tokens` measures how many gigabytes corpus per billion tokens.\n"
                # "- `t_bytes/t_tokens` measures how many terabytes corpus per trillion tokens.\n"
                "  - `char/token` measures how many chars per token on the tokenized corpus.\n"
                "  - `oov_ratio`: out-of-vocabulary ratio on the selected corpus, 👉 check [OOV charset](https://huggingface.co/spaces/eson/tokenizer-arena/raw/main/stats/compression_rate.json)\n\n"
                "You can reproduce this procedure with [compression_util.py](https://huggingface.co/spaces/eson/tokenizer-arena/blob/main/compression_util.py)."
            )

    gr.Markdown("## 🏆 Compression Rate Leaderboard\n"
                "This leaderboard aims to evaluate tokenizer performance on different languages.\n"
                "Lower `oov_ratio` refers to fewer out-of-vocabulary tokens.\n"
                "Lower `char/token` means more words might be segmented into subwords."
                )
    search_bar = gr.Textbox(
        placeholder="🔍 Search by tokenizer or organization (e.g., 'llama', 'openai') and press ENTER...",
        show_label=False,
        elem_id="search-bar",
    )
    compress_rate_table = gr.Dataframe(datatype="html")

    # func call
    compress_rate_corpus.change(
        get_compression_leaderboard,
        inputs=[compress_rate_corpus, compress_rate_unit, search_bar],
        outputs=compress_rate_table
    )
    compress_rate_unit.change(
        get_compression_leaderboard,
        inputs=[compress_rate_corpus, compress_rate_unit, search_bar],
        outputs=compress_rate_table,
        show_api=False
    )
    # file_size.change(
    #     get_all_compress_rate,
    #     outputs=compress_rate_table
    # )

    search_bar.submit(
        get_compression_leaderboard,
        inputs=[
            compress_rate_corpus,
            compress_rate_unit,
            search_bar,
        ],
        outputs=compress_rate_table,
        show_api=False
    )

    demo.load(
        get_compression_leaderboard,
        inputs=[compress_rate_corpus, compress_rate_unit],
        outputs=compress_rate_table,
        show_api=False
    )

if __name__ == "__main__":
    demo.launch()