xu-song commited on
Commit
f331792
1 Parent(s): 9558ae0
README.md CHANGED
@@ -4,7 +4,7 @@ emoji: ⚡
4
  colorFrom: red
5
  colorTo: gray
6
  sdk: gradio
7
- sdk_version: 3.41.2
8
  app_file: app.py
9
  pinned: false
10
  ---
 
4
  colorFrom: red
5
  colorTo: gray
6
  sdk: gradio
7
+ sdk_version: 4.28.3
8
  app_file: app.py
9
  pinned: false
10
  ---
app.py CHANGED
@@ -7,7 +7,7 @@ from patcher.gr_interface import TabbedInterface
7
 
8
  demo = TabbedInterface(
9
  [tab_playground, tab_compression],
10
- [" ⚔️Playground", "🏆 Compression Leaderboard",], # 编码速度,解码速度,字符分类(zh、num等,支持正则),支持的语言,机构,。
11
  title='<div align="center">Tokenizer Arena ⚔️</div>',
12
  css="css/style.css"
13
  )
 
7
 
8
  demo = TabbedInterface(
9
  [tab_playground, tab_compression],
10
+ [" ⚔️ Playground", "🏆 Compression Leaderboard",], # 编码速度,解码速度,字符分类(zh、num等,支持正则),支持的语言,机构,。
11
  title='<div align="center">Tokenizer Arena ⚔️</div>',
12
  css="css/style.css"
13
  )
app_compression.py CHANGED
@@ -79,9 +79,9 @@ with gr.Blocks() as demo:
79
  # "`num_of_trillion_tokens` `num_of_billion_tokens`\n"
80
  "- `b_tokens/g_bytes` measures how many billion tokens per gigabytes corpus. \n"
81
  "- `t_tokens/t_bytes` measures how many trillion tokens per terabytes corpus. \n"
82
- "- `n_chars/n_tokens` measures how many chars per token in the tokenized corpus. \n\n"
83
- "All the above measures are depend on corpus. You can reproduce this "
84
- "procedure at [github](https://github.com/xu-song/tokenizer-arena/)."
85
  )
86
 
87
  gr.Markdown("## 🏆 Compression Rate Leaderboard")
 
79
  # "`num_of_trillion_tokens` `num_of_billion_tokens`\n"
80
  "- `b_tokens/g_bytes` measures how many billion tokens per gigabytes corpus. \n"
81
  "- `t_tokens/t_bytes` measures how many trillion tokens per terabytes corpus. \n"
82
+ "- `n_chars/n_tokens` measures how many chars per token in the tokenized corpus. \n"
83
+ # "\nAll the above measures are depend on corpus. You can reproduce this "
84
+ # "procedure at [github](https://github.com/xu-song/tokenizer-arena/)."
85
  )
86
 
87
  gr.Markdown("## 🏆 Compression Rate Leaderboard")
app_playground.py CHANGED
@@ -58,6 +58,7 @@ with gr.Blocks() as demo:
58
  gr.Markdown("## Input Text")
59
  dropdown_examples = gr.Dropdown(
60
  example_types,
 
61
  type="index",
62
  show_label=False,
63
  container=False,
 
58
  gr.Markdown("## Input Text")
59
  dropdown_examples = gr.Dropdown(
60
  example_types,
61
+ value="Examples",
62
  type="index",
63
  show_label=False,
64
  container=False,
stats/compress_rate.json CHANGED
@@ -1864,5 +1864,2423 @@
1864
  "n_bytes": 1540504,
1865
  "n_tokens": 476666,
1866
  "n_chars": 1484970
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1867
  }
1868
  }
 
1864
  "n_bytes": 1540504,
1865
  "n_tokens": 476666,
1866
  "n_chars": 1484970
1867
+ },
1868
+ "gpt_neox_japanese_2_7b.cc100-en": {
1869
+ "vocab_size": 32000,
1870
+ "n_bytes": 1124813,
1871
+ "n_tokens": 1121413,
1872
+ "n_chars": 1121360
1873
+ },
1874
+ "gpt_neox_japanese_2_7b.cc100-zh-Hans": {
1875
+ "vocab_size": 32000,
1876
+ "n_bytes": 2633047,
1877
+ "n_tokens": 1049033,
1878
+ "n_chars": 927311
1879
+ },
1880
+ "aya_101.cc100-ja": {
1881
+ "vocab_size": 250100,
1882
+ "n_bytes": 1774770,
1883
+ "n_tokens": 300542,
1884
+ "n_chars": 603065
1885
+ },
1886
+ "baichuan.cc100-ja": {
1887
+ "vocab_size": 64000,
1888
+ "n_bytes": 1774770,
1889
+ "n_tokens": 591656,
1890
+ "n_chars": 603065
1891
+ },
1892
+ "baichuan2.cc100-ja": {
1893
+ "vocab_size": 125696,
1894
+ "n_bytes": 1774770,
1895
+ "n_tokens": 554936,
1896
+ "n_chars": 603065
1897
+ },
1898
+ "bert_base_cased.cc100-ja": {
1899
+ "vocab_size": 28996,
1900
+ "n_bytes": 1774770,
1901
+ "n_tokens": 410492,
1902
+ "n_chars": 603065
1903
+ },
1904
+ "bert_base_chinese.cc100-ja": {
1905
+ "vocab_size": 21128,
1906
+ "n_bytes": 1774770,
1907
+ "n_tokens": 396831,
1908
+ "n_chars": 603065
1909
+ },
1910
+ "bert_base_uncased.cc100-ja": {
1911
+ "vocab_size": 30522,
1912
+ "n_bytes": 1774770,
1913
+ "n_tokens": 580634,
1914
+ "n_chars": 603065
1915
+ },
1916
+ "bloom.cc100-ja": {
1917
+ "vocab_size": 250680,
1918
+ "n_bytes": 1774770,
1919
+ "n_tokens": 523592,
1920
+ "n_chars": 603065
1921
+ },
1922
+ "byt5_small.cc100-ja": {
1923
+ "vocab_size": 384,
1924
+ "n_bytes": 1774770,
1925
+ "n_tokens": 1784770,
1926
+ "n_chars": 603065
1927
+ },
1928
+ "aya_101.cc100-ar": {
1929
+ "vocab_size": 250100,
1930
+ "n_bytes": 2813283,
1931
+ "n_tokens": 631736,
1932
+ "n_chars": 1560987
1933
+ },
1934
+ "baichuan.cc100-ar": {
1935
+ "vocab_size": 64000,
1936
+ "n_bytes": 2813283,
1937
+ "n_tokens": 1422976,
1938
+ "n_chars": 1560987
1939
+ },
1940
+ "baichuan2.cc100-ar": {
1941
+ "vocab_size": 125696,
1942
+ "n_bytes": 2813283,
1943
+ "n_tokens": 1337285,
1944
+ "n_chars": 1560987
1945
+ },
1946
+ "bert_base_cased.cc100-ar": {
1947
+ "vocab_size": 28996,
1948
+ "n_bytes": 2813283,
1949
+ "n_tokens": 1232449,
1950
+ "n_chars": 1560987
1951
+ },
1952
+ "bert_base_chinese.cc100-ar": {
1953
+ "vocab_size": 21128,
1954
+ "n_bytes": 2813283,
1955
+ "n_tokens": 536389,
1956
+ "n_chars": 1560987
1957
+ },
1958
+ "bert_base_uncased.cc100-ar": {
1959
+ "vocab_size": 30522,
1960
+ "n_bytes": 2813283,
1961
+ "n_tokens": 1269370,
1962
+ "n_chars": 1560987
1963
+ },
1964
+ "bloom.cc100-ar": {
1965
+ "vocab_size": 250680,
1966
+ "n_bytes": 2813283,
1967
+ "n_tokens": 427489,
1968
+ "n_chars": 1560987
1969
+ },
1970
+ "byt5_small.cc100-ar": {
1971
+ "vocab_size": 384,
1972
+ "n_bytes": 2813283,
1973
+ "n_tokens": 2823283,
1974
+ "n_chars": 1560987
1975
+ },
1976
+ "character_glm_6b.cc100-ar": {
1977
+ "vocab_size": 64789,
1978
+ "n_bytes": 2813283,
1979
+ "n_tokens": 1441847,
1980
+ "n_chars": 1560987
1981
+ },
1982
+ "chatglm2_6b.cc100-ar": {
1983
+ "vocab_size": 64787,
1984
+ "n_bytes": 2813283,
1985
+ "n_tokens": 1441847,
1986
+ "n_chars": 1560987
1987
+ },
1988
+ "chatglm3_6b.cc100-ar": {
1989
+ "vocab_size": 64796,
1990
+ "n_bytes": 2813283,
1991
+ "n_tokens": 1441847,
1992
+ "n_chars": 1560987
1993
+ },
1994
+ "chatglm_6b.cc100-ar": {
1995
+ "vocab_size": 150344,
1996
+ "n_bytes": 2813283,
1997
+ "n_tokens": 1097200,
1998
+ "n_chars": 1560987
1999
+ },
2000
+ "chatyuan_large_v2.cc100-ar": {
2001
+ "vocab_size": 32128,
2002
+ "n_bytes": 2813283,
2003
+ "n_tokens": 1006313,
2004
+ "n_chars": 1560987
2005
+ },
2006
+ "chinese_llama.cc100-ar": {
2007
+ "vocab_size": 49953,
2008
+ "n_bytes": 2813283,
2009
+ "n_tokens": 1421625,
2010
+ "n_chars": 1560987
2011
+ },
2012
+ "chinese_llama2.cc100-ar": {
2013
+ "vocab_size": 55296,
2014
+ "n_bytes": 2813283,
2015
+ "n_tokens": 1432081,
2016
+ "n_chars": 1560987
2017
+ },
2018
+ "code_davinci_002.cc100-ar": {
2019
+ "vocab_size": 50281,
2020
+ "n_bytes": 2813283,
2021
+ "n_tokens": 1558111,
2022
+ "n_chars": 1560987
2023
+ },
2024
+ "crystal_coder.cc100-ar": {
2025
+ "vocab_size": 32022,
2026
+ "n_bytes": 2813283,
2027
+ "n_tokens": 1422081,
2028
+ "n_chars": 1560987
2029
+ },
2030
+ "dbrx_instruct.cc100-ar": {
2031
+ "vocab_size": 100280,
2032
+ "n_bytes": 2813283,
2033
+ "n_tokens": 1105640,
2034
+ "n_chars": 1560987
2035
+ },
2036
+ "deepseek_coder_33b_instruct.cc100-ar": {
2037
+ "vocab_size": 32022,
2038
+ "n_bytes": 2813283,
2039
+ "n_tokens": 1958863,
2040
+ "n_chars": 1560987
2041
+ },
2042
+ "deepseek_llm_7b_base.cc100-ar": {
2043
+ "vocab_size": 100015,
2044
+ "n_bytes": 2813283,
2045
+ "n_tokens": 1426103,
2046
+ "n_chars": 1560987
2047
+ },
2048
+ "falcon_180b.cc100-ar": {
2049
+ "vocab_size": 65024,
2050
+ "n_bytes": 2813283,
2051
+ "n_tokens": 1597443,
2052
+ "n_chars": 1560987
2053
+ },
2054
+ "falcon_7b.cc100-ar": {
2055
+ "vocab_size": 65024,
2056
+ "n_bytes": 2813283,
2057
+ "n_tokens": 1597443,
2058
+ "n_chars": 1560987
2059
+ },
2060
+ "fastchat_t5_3b.cc100-ar": {
2061
+ "vocab_size": 32110,
2062
+ "n_bytes": 2813283,
2063
+ "n_tokens": 832267,
2064
+ "n_chars": 1560987
2065
+ },
2066
+ "flan_t5_base.cc100-ar": {
2067
+ "vocab_size": 32100,
2068
+ "n_bytes": 2813283,
2069
+ "n_tokens": 568957,
2070
+ "n_chars": 1560987
2071
+ },
2072
+ "gemma_7b.cc100-ar": {
2073
+ "vocab_size": 256000,
2074
+ "n_bytes": 2813283,
2075
+ "n_tokens": 573788,
2076
+ "n_chars": 1560987
2077
+ },
2078
+ "gpt2.cc100-ar": {
2079
+ "vocab_size": 50257,
2080
+ "n_bytes": 2813283,
2081
+ "n_tokens": 1558111,
2082
+ "n_chars": 1560987
2083
+ },
2084
+ "gpt2_chinese.cc100-ar": {
2085
+ "vocab_size": 21128,
2086
+ "n_bytes": 2813283,
2087
+ "n_tokens": 617677,
2088
+ "n_chars": 1560987
2089
+ },
2090
+ "gpt_35_turbo.cc100-ar": {
2091
+ "vocab_size": 100277,
2092
+ "n_bytes": 2813283,
2093
+ "n_tokens": 1105640,
2094
+ "n_chars": 1560987
2095
+ },
2096
+ "gpt_4.cc100-ar": {
2097
+ "vocab_size": 100277,
2098
+ "n_bytes": 2813283,
2099
+ "n_tokens": 1105640,
2100
+ "n_chars": 1560987
2101
+ },
2102
+ "gpt_neox_japanese_2_7b.cc100-ar": {
2103
+ "vocab_size": 32000,
2104
+ "n_bytes": 2813283,
2105
+ "n_tokens": 2809195,
2106
+ "n_chars": 1560987
2107
+ },
2108
+ "gpt_nexo_20b.cc100-ar": {
2109
+ "vocab_size": 50277,
2110
+ "n_bytes": 2813283,
2111
+ "n_tokens": 1106277,
2112
+ "n_chars": 1560987
2113
+ },
2114
+ "grok_1.cc100-ar": {
2115
+ "vocab_size": 131072,
2116
+ "n_bytes": 2813283,
2117
+ "n_tokens": 1392088,
2118
+ "n_chars": 1560987
2119
+ },
2120
+ "internlm2_chat_7b.cc100-ar": {
2121
+ "vocab_size": 92544,
2122
+ "n_bytes": 2813283,
2123
+ "n_tokens": 1635378,
2124
+ "n_chars": 1560987
2125
+ },
2126
+ "internlm2_math_7b.cc100-ar": {
2127
+ "vocab_size": 92544,
2128
+ "n_bytes": 2813283,
2129
+ "n_tokens": 1635378,
2130
+ "n_chars": 1560987
2131
+ },
2132
+ "internlm_chat_7b.cc100-ar": {
2133
+ "vocab_size": 103168,
2134
+ "n_bytes": 2813283,
2135
+ "n_tokens": 532046,
2136
+ "n_chars": 1560987
2137
+ },
2138
+ "internlm_xcomposer_7b.cc100-ar": {
2139
+ "vocab_size": 103168,
2140
+ "n_bytes": 2813283,
2141
+ "n_tokens": 532046,
2142
+ "n_chars": 1560987
2143
+ },
2144
+ "jamba_v0_1.cc100-ar": {
2145
+ "vocab_size": 65536,
2146
+ "n_bytes": 2813283,
2147
+ "n_tokens": 727886,
2148
+ "n_chars": 1560987
2149
+ },
2150
+ "kplug.cc100-ar": {
2151
+ "vocab_size": 10261,
2152
+ "n_bytes": 2813283,
2153
+ "n_tokens": 331987,
2154
+ "n_chars": 1560987
2155
+ },
2156
+ "llama.cc100-ar": {
2157
+ "vocab_size": 32000,
2158
+ "n_bytes": 2813283,
2159
+ "n_tokens": 1432081,
2160
+ "n_chars": 1560987
2161
+ },
2162
+ "llama2.cc100-ar": {
2163
+ "vocab_size": 32001,
2164
+ "n_bytes": 2813283,
2165
+ "n_tokens": 1432081,
2166
+ "n_chars": 1560987
2167
+ },
2168
+ "llama3.cc100-ar": {
2169
+ "vocab_size": 128256,
2170
+ "n_bytes": 2813283,
2171
+ "n_tokens": 615514,
2172
+ "n_chars": 1560987
2173
+ },
2174
+ "mistral_7b.cc100-ar": {
2175
+ "vocab_size": 32000,
2176
+ "n_bytes": 2813283,
2177
+ "n_tokens": 1406319,
2178
+ "n_chars": 1560987
2179
+ },
2180
+ "mixtral_8_7b.cc100-ar": {
2181
+ "vocab_size": 32000,
2182
+ "n_bytes": 2813283,
2183
+ "n_tokens": 1406319,
2184
+ "n_chars": 1560987
2185
+ },
2186
+ "mobilebert_uncased.cc100-ar": {
2187
+ "vocab_size": 30522,
2188
+ "n_bytes": 2813283,
2189
+ "n_tokens": 1269370,
2190
+ "n_chars": 1560987
2191
+ },
2192
+ "moss.cc100-ar": {
2193
+ "vocab_size": 106072,
2194
+ "n_bytes": 2813283,
2195
+ "n_tokens": 1557671,
2196
+ "n_chars": 1560987
2197
+ },
2198
+ "mt5_large.cc100-ar": {
2199
+ "vocab_size": 250100,
2200
+ "n_bytes": 2813283,
2201
+ "n_tokens": 631736,
2202
+ "n_chars": 1560987
2203
+ },
2204
+ "olmo_7b.cc100-ar": {
2205
+ "vocab_size": 50280,
2206
+ "n_bytes": 2813283,
2207
+ "n_tokens": 1106277,
2208
+ "n_chars": 1560987
2209
+ },
2210
+ "orion_14b_chat.cc100-ar": {
2211
+ "vocab_size": 84608,
2212
+ "n_bytes": 2813283,
2213
+ "n_tokens": 1531053,
2214
+ "n_chars": 1560987
2215
+ },
2216
+ "phi_1.cc100-ar": {
2217
+ "vocab_size": 50295,
2218
+ "n_bytes": 2813283,
2219
+ "n_tokens": 1558111,
2220
+ "n_chars": 1560987
2221
+ },
2222
+ "phi_2.cc100-ar": {
2223
+ "vocab_size": 50295,
2224
+ "n_bytes": 2813283,
2225
+ "n_tokens": 1558111,
2226
+ "n_chars": 1560987
2227
+ },
2228
+ "phi_3_mini.cc100-ar": {
2229
+ "vocab_size": 32011,
2230
+ "n_bytes": 2813283,
2231
+ "n_tokens": 1432081,
2232
+ "n_chars": 1560987
2233
+ },
2234
+ "pko_t5_large.cc100-ar": {
2235
+ "vocab_size": 50358,
2236
+ "n_bytes": 2813283,
2237
+ "n_tokens": 2815586,
2238
+ "n_chars": 1560987
2239
+ },
2240
+ "prompt_clue.cc100-ar": {
2241
+ "vocab_size": 32128,
2242
+ "n_bytes": 2813283,
2243
+ "n_tokens": 1006313,
2244
+ "n_chars": 1560987
2245
+ },
2246
+ "qwen1_5_14b_chat.cc100-ar": {
2247
+ "vocab_size": 151646,
2248
+ "n_bytes": 2813283,
2249
+ "n_tokens": 614959,
2250
+ "n_chars": 1560987
2251
+ },
2252
+ "qwen_1_8b_chat.cc100-ar": {
2253
+ "vocab_size": 151851,
2254
+ "n_bytes": 2813283,
2255
+ "n_tokens": 614959,
2256
+ "n_chars": 1560987
2257
+ },
2258
+ "qwen_72b_chat.cc100-ar": {
2259
+ "vocab_size": 151851,
2260
+ "n_bytes": 2813283,
2261
+ "n_tokens": 614959,
2262
+ "n_chars": 1560987
2263
+ },
2264
+ "qwen_7b_chat.cc100-ar": {
2265
+ "vocab_size": 151851,
2266
+ "n_bytes": 2813283,
2267
+ "n_tokens": 614959,
2268
+ "n_chars": 1560987
2269
+ },
2270
+ "roberta_chinese_clue.cc100-ar": {
2271
+ "vocab_size": 8021,
2272
+ "n_bytes": 2813283,
2273
+ "n_tokens": 621762,
2274
+ "n_chars": 1560987
2275
+ },
2276
+ "skywork_13b_base.cc100-ar": {
2277
+ "vocab_size": 65519,
2278
+ "n_bytes": 2813283,
2279
+ "n_tokens": 1432065,
2280
+ "n_chars": 1560987
2281
+ },
2282
+ "skywork_13b_math.cc100-ar": {
2283
+ "vocab_size": 65519,
2284
+ "n_bytes": 2813283,
2285
+ "n_tokens": 1432065,
2286
+ "n_chars": 1560987
2287
+ },
2288
+ "solar_10_7b.cc100-ar": {
2289
+ "vocab_size": 32000,
2290
+ "n_bytes": 2813283,
2291
+ "n_tokens": 1406319,
2292
+ "n_chars": 1560987
2293
+ },
2294
+ "starchat_alpha.cc100-ar": {
2295
+ "vocab_size": 49156,
2296
+ "n_bytes": 2813283,
2297
+ "n_tokens": 1195640,
2298
+ "n_chars": 1560987
2299
+ },
2300
+ "switch_c_2048.cc100-ar": {
2301
+ "vocab_size": 32100,
2302
+ "n_bytes": 2813283,
2303
+ "n_tokens": 568855,
2304
+ "n_chars": 1560987
2305
+ },
2306
+ "t5_base.cc100-ar": {
2307
+ "vocab_size": 32100,
2308
+ "n_bytes": 2813283,
2309
+ "n_tokens": 568855,
2310
+ "n_chars": 1560987
2311
+ },
2312
+ "t5_large.cc100-ar": {
2313
+ "vocab_size": 32100,
2314
+ "n_bytes": 2813283,
2315
+ "n_tokens": 568855,
2316
+ "n_chars": 1560987
2317
+ },
2318
+ "t5_small.cc100-ar": {
2319
+ "vocab_size": 32100,
2320
+ "n_bytes": 2813283,
2321
+ "n_tokens": 568855,
2322
+ "n_chars": 1560987
2323
+ },
2324
+ "text_davinci_003.cc100-ar": {
2325
+ "vocab_size": 50281,
2326
+ "n_bytes": 2813283,
2327
+ "n_tokens": 1558111,
2328
+ "n_chars": 1560987
2329
+ },
2330
+ "tigerbot_13b_chat_v2.cc100-ar": {
2331
+ "vocab_size": 60515,
2332
+ "n_bytes": 2813283,
2333
+ "n_tokens": 1422070,
2334
+ "n_chars": 1560987
2335
+ },
2336
+ "tigerbot_70b_chat_v4_4k.cc100-ar": {
2337
+ "vocab_size": 65110,
2338
+ "n_bytes": 2813283,
2339
+ "n_tokens": 1422073,
2340
+ "n_chars": 1560987
2341
+ },
2342
+ "wizardcoder_15b_v1.cc100-ar": {
2343
+ "vocab_size": 49153,
2344
+ "n_bytes": 2813283,
2345
+ "n_tokens": 1195640,
2346
+ "n_chars": 1560987
2347
+ },
2348
+ "wizardcoder_python_7b_v1.cc100-ar": {
2349
+ "vocab_size": 32001,
2350
+ "n_bytes": 2813283,
2351
+ "n_tokens": 1432081,
2352
+ "n_chars": 1560987
2353
+ },
2354
+ "wizardlm_7b_v1.cc100-ar": {
2355
+ "vocab_size": 32001,
2356
+ "n_bytes": 2813283,
2357
+ "n_tokens": 1432081,
2358
+ "n_chars": 1560987
2359
+ },
2360
+ "wizardmath_70b_v1.cc100-ar": {
2361
+ "vocab_size": 32002,
2362
+ "n_bytes": 2813283,
2363
+ "n_tokens": 1432081,
2364
+ "n_chars": 1560987
2365
+ },
2366
+ "xlm_roberta.cc100-ar": {
2367
+ "vocab_size": 250002,
2368
+ "n_bytes": 2813283,
2369
+ "n_tokens": 518287,
2370
+ "n_chars": 1560987
2371
+ },
2372
+ "yi_34b.cc100-ar": {
2373
+ "vocab_size": 64000,
2374
+ "n_bytes": 2813283,
2375
+ "n_tokens": 1795801,
2376
+ "n_chars": 1560987
2377
+ },
2378
+ "yi_6b.cc100-ar": {
2379
+ "vocab_size": 64000,
2380
+ "n_bytes": 2813283,
2381
+ "n_tokens": 1795801,
2382
+ "n_chars": 1560987
2383
+ },
2384
+ "yi_vl34b.cc100-ar": {
2385
+ "vocab_size": 64000,
2386
+ "n_bytes": 2813283,
2387
+ "n_tokens": 1803957,
2388
+ "n_chars": 1560987
2389
+ },
2390
+ "zephyr_7b_beta.cc100-ar": {
2391
+ "vocab_size": 32000,
2392
+ "n_bytes": 2813283,
2393
+ "n_tokens": 1406319,
2394
+ "n_chars": 1560987
2395
+ },
2396
+ "aya_101.cc100-de": {
2397
+ "vocab_size": 250100,
2398
+ "n_bytes": 1814876,
2399
+ "n_tokens": 480418,
2400
+ "n_chars": 1784021
2401
+ },
2402
+ "baichuan.cc100-de": {
2403
+ "vocab_size": 64000,
2404
+ "n_bytes": 1814876,
2405
+ "n_tokens": 680512,
2406
+ "n_chars": 1784021
2407
+ },
2408
+ "baichuan2.cc100-de": {
2409
+ "vocab_size": 125696,
2410
+ "n_bytes": 1814876,
2411
+ "n_tokens": 628063,
2412
+ "n_chars": 1784021
2413
+ },
2414
+ "bert_base_cased.cc100-de": {
2415
+ "vocab_size": 28996,
2416
+ "n_bytes": 1814876,
2417
+ "n_tokens": 731093,
2418
+ "n_chars": 1784021
2419
+ },
2420
+ "bert_base_chinese.cc100-de": {
2421
+ "vocab_size": 21128,
2422
+ "n_bytes": 1814876,
2423
+ "n_tokens": 561246,
2424
+ "n_chars": 1784021
2425
+ },
2426
+ "bert_base_uncased.cc100-de": {
2427
+ "vocab_size": 30522,
2428
+ "n_bytes": 1814876,
2429
+ "n_tokens": 646485,
2430
+ "n_chars": 1784021
2431
+ },
2432
+ "bloom.cc100-de": {
2433
+ "vocab_size": 250680,
2434
+ "n_bytes": 1814876,
2435
+ "n_tokens": 541170,
2436
+ "n_chars": 1784021
2437
+ },
2438
+ "byt5_small.cc100-de": {
2439
+ "vocab_size": 384,
2440
+ "n_bytes": 1814876,
2441
+ "n_tokens": 1824876,
2442
+ "n_chars": 1784021
2443
+ },
2444
+ "character_glm_6b.cc100-de": {
2445
+ "vocab_size": 64789,
2446
+ "n_bytes": 1814876,
2447
+ "n_tokens": 639822,
2448
+ "n_chars": 1784021
2449
+ },
2450
+ "chatglm2_6b.cc100-de": {
2451
+ "vocab_size": 64787,
2452
+ "n_bytes": 1814876,
2453
+ "n_tokens": 639757,
2454
+ "n_chars": 1784021
2455
+ },
2456
+ "chatglm3_6b.cc100-de": {
2457
+ "vocab_size": 64796,
2458
+ "n_bytes": 1814876,
2459
+ "n_tokens": 639822,
2460
+ "n_chars": 1784021
2461
+ },
2462
+ "chatglm_6b.cc100-de": {
2463
+ "vocab_size": 150344,
2464
+ "n_bytes": 1814876,
2465
+ "n_tokens": 589464,
2466
+ "n_chars": 1784021
2467
+ },
2468
+ "chatyuan_large_v2.cc100-de": {
2469
+ "vocab_size": 32128,
2470
+ "n_bytes": 1814876,
2471
+ "n_tokens": 970463,
2472
+ "n_chars": 1784021
2473
+ },
2474
+ "chinese_llama.cc100-de": {
2475
+ "vocab_size": 49953,
2476
+ "n_bytes": 1814876,
2477
+ "n_tokens": 523859,
2478
+ "n_chars": 1784021
2479
+ },
2480
+ "chinese_llama2.cc100-de": {
2481
+ "vocab_size": 55296,
2482
+ "n_bytes": 1814876,
2483
+ "n_tokens": 537318,
2484
+ "n_chars": 1784021
2485
+ },
2486
+ "code_davinci_002.cc100-de": {
2487
+ "vocab_size": 50281,
2488
+ "n_bytes": 1814876,
2489
+ "n_tokens": 684666,
2490
+ "n_chars": 1784021
2491
+ },
2492
+ "crystal_coder.cc100-de": {
2493
+ "vocab_size": 32022,
2494
+ "n_bytes": 1814876,
2495
+ "n_tokens": 527320,
2496
+ "n_chars": 1784021
2497
+ },
2498
+ "dbrx_instruct.cc100-de": {
2499
+ "vocab_size": 100280,
2500
+ "n_bytes": 1814876,
2501
+ "n_tokens": 500870,
2502
+ "n_chars": 1784021
2503
+ },
2504
+ "deepseek_coder_33b_instruct.cc100-de": {
2505
+ "vocab_size": 32022,
2506
+ "n_bytes": 1814876,
2507
+ "n_tokens": 745618,
2508
+ "n_chars": 1784021
2509
+ },
2510
+ "deepseek_llm_7b_base.cc100-de": {
2511
+ "vocab_size": 100015,
2512
+ "n_bytes": 1814876,
2513
+ "n_tokens": 642573,
2514
+ "n_chars": 1784021
2515
+ },
2516
+ "falcon_180b.cc100-de": {
2517
+ "vocab_size": 65024,
2518
+ "n_bytes": 1814876,
2519
+ "n_tokens": 497054,
2520
+ "n_chars": 1784021
2521
+ },
2522
+ "falcon_7b.cc100-de": {
2523
+ "vocab_size": 65024,
2524
+ "n_bytes": 1814876,
2525
+ "n_tokens": 497054,
2526
+ "n_chars": 1784021
2527
+ },
2528
+ "fastchat_t5_3b.cc100-de": {
2529
+ "vocab_size": 32110,
2530
+ "n_bytes": 1814876,
2531
+ "n_tokens": 736989,
2532
+ "n_chars": 1784021
2533
+ },
2534
+ "flan_t5_base.cc100-de": {
2535
+ "vocab_size": 32100,
2536
+ "n_bytes": 1814876,
2537
+ "n_tokens": 480254,
2538
+ "n_chars": 1784021
2539
+ },
2540
+ "gemma_7b.cc100-de": {
2541
+ "vocab_size": 256000,
2542
+ "n_bytes": 1814876,
2543
+ "n_tokens": 416876,
2544
+ "n_chars": 1784021
2545
+ },
2546
+ "gpt2.cc100-de": {
2547
+ "vocab_size": 50257,
2548
+ "n_bytes": 1814876,
2549
+ "n_tokens": 684669,
2550
+ "n_chars": 1784021
2551
+ },
2552
+ "gpt2_chinese.cc100-de": {
2553
+ "vocab_size": 21128,
2554
+ "n_bytes": 1814876,
2555
+ "n_tokens": 786497,
2556
+ "n_chars": 1784021
2557
+ },
2558
+ "gpt_35_turbo.cc100-de": {
2559
+ "vocab_size": 100277,
2560
+ "n_bytes": 1814876,
2561
+ "n_tokens": 500870,
2562
+ "n_chars": 1784021
2563
+ },
2564
+ "gpt_4.cc100-de": {
2565
+ "vocab_size": 100277,
2566
+ "n_bytes": 1814876,
2567
+ "n_tokens": 500870,
2568
+ "n_chars": 1784021
2569
+ },
2570
+ "gpt_neox_japanese_2_7b.cc100-de": {
2571
+ "vocab_size": 32000,
2572
+ "n_bytes": 1814876,
2573
+ "n_tokens": 1807780,
2574
+ "n_chars": 1784021
2575
+ },
2576
+ "gpt_nexo_20b.cc100-de": {
2577
+ "vocab_size": 50277,
2578
+ "n_bytes": 1814876,
2579
+ "n_tokens": 583628,
2580
+ "n_chars": 1784021
2581
+ },
2582
+ "grok_1.cc100-de": {
2583
+ "vocab_size": 131072,
2584
+ "n_bytes": 1814876,
2585
+ "n_tokens": 505220,
2586
+ "n_chars": 1784021
2587
+ },
2588
+ "internlm2_chat_7b.cc100-de": {
2589
+ "vocab_size": 92544,
2590
+ "n_bytes": 1814876,
2591
+ "n_tokens": 583917,
2592
+ "n_chars": 1784021
2593
+ },
2594
+ "internlm2_math_7b.cc100-de": {
2595
+ "vocab_size": 92544,
2596
+ "n_bytes": 1814876,
2597
+ "n_tokens": 583917,
2598
+ "n_chars": 1784021
2599
+ },
2600
+ "internlm_chat_7b.cc100-de": {
2601
+ "vocab_size": 103168,
2602
+ "n_bytes": 1814876,
2603
+ "n_tokens": 580489,
2604
+ "n_chars": 1784021
2605
+ },
2606
+ "internlm_xcomposer_7b.cc100-de": {
2607
+ "vocab_size": 103168,
2608
+ "n_bytes": 1814876,
2609
+ "n_tokens": 580489,
2610
+ "n_chars": 1784021
2611
+ },
2612
+ "jamba_v0_1.cc100-de": {
2613
+ "vocab_size": 65536,
2614
+ "n_bytes": 1814876,
2615
+ "n_tokens": 535856,
2616
+ "n_chars": 1784021
2617
+ },
2618
+ "kplug.cc100-de": {
2619
+ "vocab_size": 10261,
2620
+ "n_bytes": 1814876,
2621
+ "n_tokens": 789053,
2622
+ "n_chars": 1784021
2623
+ },
2624
+ "llama.cc100-de": {
2625
+ "vocab_size": 32000,
2626
+ "n_bytes": 1814876,
2627
+ "n_tokens": 537320,
2628
+ "n_chars": 1784021
2629
+ },
2630
+ "llama2.cc100-de": {
2631
+ "vocab_size": 32001,
2632
+ "n_bytes": 1814876,
2633
+ "n_tokens": 537320,
2634
+ "n_chars": 1784021
2635
+ },
2636
+ "llama3.cc100-de": {
2637
+ "vocab_size": 128256,
2638
+ "n_bytes": 1814876,
2639
+ "n_tokens": 499766,
2640
+ "n_chars": 1784021
2641
+ },
2642
+ "mistral_7b.cc100-de": {
2643
+ "vocab_size": 32000,
2644
+ "n_bytes": 1814876,
2645
+ "n_tokens": 577526,
2646
+ "n_chars": 1784021
2647
+ },
2648
+ "mixtral_8_7b.cc100-de": {
2649
+ "vocab_size": 32000,
2650
+ "n_bytes": 1814876,
2651
+ "n_tokens": 577526,
2652
+ "n_chars": 1784021
2653
+ },
2654
+ "mobilebert_uncased.cc100-de": {
2655
+ "vocab_size": 30522,
2656
+ "n_bytes": 1814876,
2657
+ "n_tokens": 646485,
2658
+ "n_chars": 1784021
2659
+ },
2660
+ "moss.cc100-de": {
2661
+ "vocab_size": 106072,
2662
+ "n_bytes": 1814876,
2663
+ "n_tokens": 683401,
2664
+ "n_chars": 1784021
2665
+ },
2666
+ "mt5_large.cc100-de": {
2667
+ "vocab_size": 250100,
2668
+ "n_bytes": 1814876,
2669
+ "n_tokens": 480418,
2670
+ "n_chars": 1784021
2671
+ },
2672
+ "olmo_7b.cc100-de": {
2673
+ "vocab_size": 50280,
2674
+ "n_bytes": 1814876,
2675
+ "n_tokens": 583628,
2676
+ "n_chars": 1784021
2677
+ },
2678
+ "orion_14b_chat.cc100-de": {
2679
+ "vocab_size": 84608,
2680
+ "n_bytes": 1814876,
2681
+ "n_tokens": 744404,
2682
+ "n_chars": 1784021
2683
+ },
2684
+ "phi_1.cc100-de": {
2685
+ "vocab_size": 50295,
2686
+ "n_bytes": 1814876,
2687
+ "n_tokens": 684665,
2688
+ "n_chars": 1784021
2689
+ },
2690
+ "phi_2.cc100-de": {
2691
+ "vocab_size": 50295,
2692
+ "n_bytes": 1814876,
2693
+ "n_tokens": 684665,
2694
+ "n_chars": 1784021
2695
+ },
2696
+ "phi_3_mini.cc100-de": {
2697
+ "vocab_size": 32011,
2698
+ "n_bytes": 1814876,
2699
+ "n_tokens": 537320,
2700
+ "n_chars": 1784021
2701
+ },
2702
+ "pko_t5_large.cc100-de": {
2703
+ "vocab_size": 50358,
2704
+ "n_bytes": 1814876,
2705
+ "n_tokens": 1254350,
2706
+ "n_chars": 1784021
2707
+ },
2708
+ "prompt_clue.cc100-de": {
2709
+ "vocab_size": 32128,
2710
+ "n_bytes": 1814876,
2711
+ "n_tokens": 970463,
2712
+ "n_chars": 1784021
2713
+ },
2714
+ "qwen1_5_14b_chat.cc100-de": {
2715
+ "vocab_size": 151646,
2716
+ "n_bytes": 1814876,
2717
+ "n_tokens": 503561,
2718
+ "n_chars": 1784021
2719
+ },
2720
+ "qwen_1_8b_chat.cc100-de": {
2721
+ "vocab_size": 151851,
2722
+ "n_bytes": 1814876,
2723
+ "n_tokens": 503561,
2724
+ "n_chars": 1784021
2725
+ },
2726
+ "qwen_72b_chat.cc100-de": {
2727
+ "vocab_size": 151851,
2728
+ "n_bytes": 1814876,
2729
+ "n_tokens": 503561,
2730
+ "n_chars": 1784021
2731
+ },
2732
+ "qwen_7b_chat.cc100-de": {
2733
+ "vocab_size": 151851,
2734
+ "n_bytes": 1814876,
2735
+ "n_tokens": 503561,
2736
+ "n_chars": 1784021
2737
+ },
2738
+ "roberta_chinese_clue.cc100-de": {
2739
+ "vocab_size": 8021,
2740
+ "n_bytes": 1814876,
2741
+ "n_tokens": 915612,
2742
+ "n_chars": 1784021
2743
+ },
2744
+ "skywork_13b_base.cc100-de": {
2745
+ "vocab_size": 65519,
2746
+ "n_bytes": 1814876,
2747
+ "n_tokens": 537308,
2748
+ "n_chars": 1784021
2749
+ },
2750
+ "skywork_13b_math.cc100-de": {
2751
+ "vocab_size": 65519,
2752
+ "n_bytes": 1814876,
2753
+ "n_tokens": 537308,
2754
+ "n_chars": 1784021
2755
+ },
2756
+ "solar_10_7b.cc100-de": {
2757
+ "vocab_size": 32000,
2758
+ "n_bytes": 1814876,
2759
+ "n_tokens": 577526,
2760
+ "n_chars": 1784021
2761
+ },
2762
+ "starchat_alpha.cc100-de": {
2763
+ "vocab_size": 49156,
2764
+ "n_bytes": 1814876,
2765
+ "n_tokens": 620541,
2766
+ "n_chars": 1784021
2767
+ },
2768
+ "switch_c_2048.cc100-de": {
2769
+ "vocab_size": 32100,
2770
+ "n_bytes": 1814876,
2771
+ "n_tokens": 480254,
2772
+ "n_chars": 1784021
2773
+ },
2774
+ "t5_base.cc100-de": {
2775
+ "vocab_size": 32100,
2776
+ "n_bytes": 1814876,
2777
+ "n_tokens": 480254,
2778
+ "n_chars": 1784021
2779
+ },
2780
+ "t5_large.cc100-de": {
2781
+ "vocab_size": 32100,
2782
+ "n_bytes": 1814876,
2783
+ "n_tokens": 480254,
2784
+ "n_chars": 1784021
2785
+ },
2786
+ "t5_small.cc100-de": {
2787
+ "vocab_size": 32100,
2788
+ "n_bytes": 1814876,
2789
+ "n_tokens": 480254,
2790
+ "n_chars": 1784021
2791
+ },
2792
+ "text_davinci_003.cc100-de": {
2793
+ "vocab_size": 50281,
2794
+ "n_bytes": 1814876,
2795
+ "n_tokens": 684666,
2796
+ "n_chars": 1784021
2797
+ },
2798
+ "tigerbot_13b_chat_v2.cc100-de": {
2799
+ "vocab_size": 60515,
2800
+ "n_bytes": 1814876,
2801
+ "n_tokens": 528918,
2802
+ "n_chars": 1784021
2803
+ },
2804
+ "tigerbot_70b_chat_v4_4k.cc100-de": {
2805
+ "vocab_size": 65110,
2806
+ "n_bytes": 1814876,
2807
+ "n_tokens": 529170,
2808
+ "n_chars": 1784021
2809
+ },
2810
+ "wizardcoder_15b_v1.cc100-de": {
2811
+ "vocab_size": 49153,
2812
+ "n_bytes": 1814876,
2813
+ "n_tokens": 620541,
2814
+ "n_chars": 1784021
2815
+ },
2816
+ "wizardcoder_python_7b_v1.cc100-de": {
2817
+ "vocab_size": 32001,
2818
+ "n_bytes": 1814876,
2819
+ "n_tokens": 537320,
2820
+ "n_chars": 1784021
2821
+ },
2822
+ "wizardlm_7b_v1.cc100-de": {
2823
+ "vocab_size": 32001,
2824
+ "n_bytes": 1814876,
2825
+ "n_tokens": 537320,
2826
+ "n_chars": 1784021
2827
+ },
2828
+ "wizardmath_70b_v1.cc100-de": {
2829
+ "vocab_size": 32002,
2830
+ "n_bytes": 1814876,
2831
+ "n_tokens": 537320,
2832
+ "n_chars": 1784021
2833
+ },
2834
+ "xlm_roberta.cc100-de": {
2835
+ "vocab_size": 250002,
2836
+ "n_bytes": 1814876,
2837
+ "n_tokens": 432571,
2838
+ "n_chars": 1784021
2839
+ },
2840
+ "yi_34b.cc100-de": {
2841
+ "vocab_size": 64000,
2842
+ "n_bytes": 1814876,
2843
+ "n_tokens": 698366,
2844
+ "n_chars": 1784021
2845
+ },
2846
+ "yi_6b.cc100-de": {
2847
+ "vocab_size": 64000,
2848
+ "n_bytes": 1814876,
2849
+ "n_tokens": 698366,
2850
+ "n_chars": 1784021
2851
+ },
2852
+ "yi_vl34b.cc100-de": {
2853
+ "vocab_size": 64000,
2854
+ "n_bytes": 1814876,
2855
+ "n_tokens": 697065,
2856
+ "n_chars": 1784021
2857
+ },
2858
+ "zephyr_7b_beta.cc100-de": {
2859
+ "vocab_size": 32000,
2860
+ "n_bytes": 1814876,
2861
+ "n_tokens": 577526,
2862
+ "n_chars": 1784021
2863
+ },
2864
+ "gpt_neox_japanese_2_7b.cc100-es": {
2865
+ "vocab_size": 32000,
2866
+ "n_bytes": 1664455,
2867
+ "n_tokens": 1658946,
2868
+ "n_chars": 1630297
2869
+ },
2870
+ "gpt_neox_japanese_2_7b.cc100-fr": {
2871
+ "vocab_size": 32000,
2872
+ "n_bytes": 1540504,
2873
+ "n_tokens": 1524129,
2874
+ "n_chars": 1484970
2875
+ },
2876
+ "character_glm_6b.cc100-ja": {
2877
+ "vocab_size": 64789,
2878
+ "n_bytes": 1774770,
2879
+ "n_tokens": 601380,
2880
+ "n_chars": 603065
2881
+ },
2882
+ "chatglm2_6b.cc100-ja": {
2883
+ "vocab_size": 64787,
2884
+ "n_bytes": 1774770,
2885
+ "n_tokens": 601380,
2886
+ "n_chars": 603065
2887
+ },
2888
+ "chatglm3_6b.cc100-ja": {
2889
+ "vocab_size": 64796,
2890
+ "n_bytes": 1774770,
2891
+ "n_tokens": 601380,
2892
+ "n_chars": 603065
2893
+ },
2894
+ "chatglm_6b.cc100-ja": {
2895
+ "vocab_size": 150344,
2896
+ "n_bytes": 1774770,
2897
+ "n_tokens": 489930,
2898
+ "n_chars": 603065
2899
+ },
2900
+ "chatyuan_large_v2.cc100-ja": {
2901
+ "vocab_size": 32128,
2902
+ "n_bytes": 1774770,
2903
+ "n_tokens": 575118,
2904
+ "n_chars": 603065
2905
+ },
2906
+ "chinese_llama.cc100-ja": {
2907
+ "vocab_size": 49953,
2908
+ "n_bytes": 1774770,
2909
+ "n_tokens": 614177,
2910
+ "n_chars": 603065
2911
+ },
2912
+ "chinese_llama2.cc100-ja": {
2913
+ "vocab_size": 55296,
2914
+ "n_bytes": 1774770,
2915
+ "n_tokens": 624362,
2916
+ "n_chars": 603065
2917
+ },
2918
+ "code_davinci_002.cc100-ja": {
2919
+ "vocab_size": 50281,
2920
+ "n_bytes": 1774770,
2921
+ "n_tokens": 844362,
2922
+ "n_chars": 603065
2923
+ },
2924
+ "crystal_coder.cc100-ja": {
2925
+ "vocab_size": 32022,
2926
+ "n_bytes": 1774770,
2927
+ "n_tokens": 718461,
2928
+ "n_chars": 603065
2929
+ },
2930
+ "dbrx_instruct.cc100-ja": {
2931
+ "vocab_size": 100280,
2932
+ "n_bytes": 1774770,
2933
+ "n_tokens": 630348,
2934
+ "n_chars": 603065
2935
+ },
2936
+ "deepseek_coder_33b_instruct.cc100-ja": {
2937
+ "vocab_size": 32022,
2938
+ "n_bytes": 1774770,
2939
+ "n_tokens": 1018060,
2940
+ "n_chars": 603065
2941
+ },
2942
+ "deepseek_llm_7b_base.cc100-ja": {
2943
+ "vocab_size": 100015,
2944
+ "n_bytes": 1774770,
2945
+ "n_tokens": 761467,
2946
+ "n_chars": 603065
2947
+ },
2948
+ "falcon_180b.cc100-ja": {
2949
+ "vocab_size": 65024,
2950
+ "n_bytes": 1774770,
2951
+ "n_tokens": 842458,
2952
+ "n_chars": 603065
2953
+ },
2954
+ "falcon_7b.cc100-ja": {
2955
+ "vocab_size": 65024,
2956
+ "n_bytes": 1774770,
2957
+ "n_tokens": 842458,
2958
+ "n_chars": 603065
2959
+ },
2960
+ "fastchat_t5_3b.cc100-ja": {
2961
+ "vocab_size": 32110,
2962
+ "n_bytes": 1774770,
2963
+ "n_tokens": 53915,
2964
+ "n_chars": 603065
2965
+ },
2966
+ "flan_t5_base.cc100-ja": {
2967
+ "vocab_size": 32100,
2968
+ "n_bytes": 1774770,
2969
+ "n_tokens": 51999,
2970
+ "n_chars": 603065
2971
+ },
2972
+ "gemma_7b.cc100-ja": {
2973
+ "vocab_size": 256000,
2974
+ "n_bytes": 1774770,
2975
+ "n_tokens": 317873,
2976
+ "n_chars": 603065
2977
+ },
2978
+ "gpt2.cc100-ja": {
2979
+ "vocab_size": 50257,
2980
+ "n_bytes": 1774770,
2981
+ "n_tokens": 844362,
2982
+ "n_chars": 603065
2983
+ },
2984
+ "gpt2_chinese.cc100-ja": {
2985
+ "vocab_size": 21128,
2986
+ "n_bytes": 1774770,
2987
+ "n_tokens": 503085,
2988
+ "n_chars": 603065
2989
+ },
2990
+ "gpt_35_turbo.cc100-ja": {
2991
+ "vocab_size": 100277,
2992
+ "n_bytes": 1774770,
2993
+ "n_tokens": 630348,
2994
+ "n_chars": 603065
2995
+ },
2996
+ "gpt_4.cc100-ja": {
2997
+ "vocab_size": 100277,
2998
+ "n_bytes": 1774770,
2999
+ "n_tokens": 630348,
3000
+ "n_chars": 603065
3001
+ },
3002
+ "gpt_neox_japanese_2_7b.cc100-ja": {
3003
+ "vocab_size": 32000,
3004
+ "n_bytes": 1774770,
3005
+ "n_tokens": 410803,
3006
+ "n_chars": 603065
3007
+ },
3008
+ "gpt_nexo_20b.cc100-ja": {
3009
+ "vocab_size": 50277,
3010
+ "n_bytes": 1774770,
3011
+ "n_tokens": 605168,
3012
+ "n_chars": 603065
3013
+ },
3014
+ "grok_1.cc100-ja": {
3015
+ "vocab_size": 131072,
3016
+ "n_bytes": 1774770,
3017
+ "n_tokens": 497590,
3018
+ "n_chars": 603065
3019
+ },
3020
+ "internlm2_chat_7b.cc100-ja": {
3021
+ "vocab_size": 92544,
3022
+ "n_bytes": 1774770,
3023
+ "n_tokens": 595803,
3024
+ "n_chars": 603065
3025
+ },
3026
+ "internlm2_math_7b.cc100-ja": {
3027
+ "vocab_size": 92544,
3028
+ "n_bytes": 1774770,
3029
+ "n_tokens": 595803,
3030
+ "n_chars": 603065
3031
+ },
3032
+ "internlm_chat_7b.cc100-ja": {
3033
+ "vocab_size": 103168,
3034
+ "n_bytes": 1774770,
3035
+ "n_tokens": 448212,
3036
+ "n_chars": 603065
3037
+ },
3038
+ "internlm_xcomposer_7b.cc100-ja": {
3039
+ "vocab_size": 103168,
3040
+ "n_bytes": 1774770,
3041
+ "n_tokens": 448212,
3042
+ "n_chars": 603065
3043
+ },
3044
+ "jamba_v0_1.cc100-ja": {
3045
+ "vocab_size": 65536,
3046
+ "n_bytes": 1774770,
3047
+ "n_tokens": 683256,
3048
+ "n_chars": 603065
3049
+ },
3050
+ "kplug.cc100-ja": {
3051
+ "vocab_size": 10261,
3052
+ "n_bytes": 1774770,
3053
+ "n_tokens": 338023,
3054
+ "n_chars": 603065
3055
+ },
3056
+ "llama.cc100-ja": {
3057
+ "vocab_size": 32000,
3058
+ "n_bytes": 1774770,
3059
+ "n_tokens": 728461,
3060
+ "n_chars": 603065
3061
+ },
3062
+ "llama2.cc100-ja": {
3063
+ "vocab_size": 32001,
3064
+ "n_bytes": 1774770,
3065
+ "n_tokens": 728461,
3066
+ "n_chars": 603065
3067
+ },
3068
+ "llama3.cc100-ja": {
3069
+ "vocab_size": 128256,
3070
+ "n_bytes": 1774770,
3071
+ "n_tokens": 414715,
3072
+ "n_chars": 603065
3073
+ },
3074
+ "mistral_7b.cc100-ja": {
3075
+ "vocab_size": 32000,
3076
+ "n_bytes": 1774770,
3077
+ "n_tokens": 685134,
3078
+ "n_chars": 603065
3079
+ },
3080
+ "mixtral_8_7b.cc100-ja": {
3081
+ "vocab_size": 32000,
3082
+ "n_bytes": 1774770,
3083
+ "n_tokens": 685134,
3084
+ "n_chars": 603065
3085
+ },
3086
+ "mobilebert_uncased.cc100-ja": {
3087
+ "vocab_size": 30522,
3088
+ "n_bytes": 1774770,
3089
+ "n_tokens": 580634,
3090
+ "n_chars": 603065
3091
+ },
3092
+ "moss.cc100-ja": {
3093
+ "vocab_size": 106072,
3094
+ "n_bytes": 1774770,
3095
+ "n_tokens": 600011,
3096
+ "n_chars": 603065
3097
+ },
3098
+ "mt5_large.cc100-ja": {
3099
+ "vocab_size": 250100,
3100
+ "n_bytes": 1774770,
3101
+ "n_tokens": 300542,
3102
+ "n_chars": 603065
3103
+ },
3104
+ "olmo_7b.cc100-ja": {
3105
+ "vocab_size": 50280,
3106
+ "n_bytes": 1774770,
3107
+ "n_tokens": 605168,
3108
+ "n_chars": 603065
3109
+ },
3110
+ "orion_14b_chat.cc100-ja": {
3111
+ "vocab_size": 84608,
3112
+ "n_bytes": 1774770,
3113
+ "n_tokens": 324956,
3114
+ "n_chars": 603065
3115
+ },
3116
+ "phi_1.cc100-ja": {
3117
+ "vocab_size": 50295,
3118
+ "n_bytes": 1774770,
3119
+ "n_tokens": 844362,
3120
+ "n_chars": 603065
3121
+ },
3122
+ "phi_2.cc100-ja": {
3123
+ "vocab_size": 50295,
3124
+ "n_bytes": 1774770,
3125
+ "n_tokens": 844362,
3126
+ "n_chars": 603065
3127
+ },
3128
+ "phi_3_mini.cc100-ja": {
3129
+ "vocab_size": 32011,
3130
+ "n_bytes": 1774770,
3131
+ "n_tokens": 728461,
3132
+ "n_chars": 603065
3133
+ },
3134
+ "pko_t5_large.cc100-ja": {
3135
+ "vocab_size": 50358,
3136
+ "n_bytes": 1774770,
3137
+ "n_tokens": 1766950,
3138
+ "n_chars": 603065
3139
+ },
3140
+ "prompt_clue.cc100-ja": {
3141
+ "vocab_size": 32128,
3142
+ "n_bytes": 1774770,
3143
+ "n_tokens": 575118,
3144
+ "n_chars": 603065
3145
+ },
3146
+ "qwen1_5_14b_chat.cc100-ja": {
3147
+ "vocab_size": 151646,
3148
+ "n_bytes": 1774770,
3149
+ "n_tokens": 377144,
3150
+ "n_chars": 603065
3151
+ },
3152
+ "qwen_1_8b_chat.cc100-ja": {
3153
+ "vocab_size": 151851,
3154
+ "n_bytes": 1774770,
3155
+ "n_tokens": 377144,
3156
+ "n_chars": 603065
3157
+ },
3158
+ "qwen_72b_chat.cc100-ja": {
3159
+ "vocab_size": 151851,
3160
+ "n_bytes": 1774770,
3161
+ "n_tokens": 377144,
3162
+ "n_chars": 603065
3163
+ },
3164
+ "qwen_7b_chat.cc100-ja": {
3165
+ "vocab_size": 151851,
3166
+ "n_bytes": 1774770,
3167
+ "n_tokens": 377144,
3168
+ "n_chars": 603065
3169
+ },
3170
+ "roberta_chinese_clue.cc100-ja": {
3171
+ "vocab_size": 8021,
3172
+ "n_bytes": 1774770,
3173
+ "n_tokens": 339411,
3174
+ "n_chars": 603065
3175
+ },
3176
+ "skywork_13b_base.cc100-ja": {
3177
+ "vocab_size": 65519,
3178
+ "n_bytes": 1774770,
3179
+ "n_tokens": 603613,
3180
+ "n_chars": 603065
3181
+ },
3182
+ "skywork_13b_math.cc100-ja": {
3183
+ "vocab_size": 65519,
3184
+ "n_bytes": 1774770,
3185
+ "n_tokens": 603613,
3186
+ "n_chars": 603065
3187
+ },
3188
+ "solar_10_7b.cc100-ja": {
3189
+ "vocab_size": 32000,
3190
+ "n_bytes": 1774770,
3191
+ "n_tokens": 685134,
3192
+ "n_chars": 603065
3193
+ },
3194
+ "starchat_alpha.cc100-ja": {
3195
+ "vocab_size": 49156,
3196
+ "n_bytes": 1774770,
3197
+ "n_tokens": 546876,
3198
+ "n_chars": 603065
3199
+ },
3200
+ "switch_c_2048.cc100-ja": {
3201
+ "vocab_size": 32100,
3202
+ "n_bytes": 1774770,
3203
+ "n_tokens": 51947,
3204
+ "n_chars": 603065
3205
+ },
3206
+ "t5_base.cc100-ja": {
3207
+ "vocab_size": 32100,
3208
+ "n_bytes": 1774770,
3209
+ "n_tokens": 51947,
3210
+ "n_chars": 603065
3211
+ },
3212
+ "t5_large.cc100-ja": {
3213
+ "vocab_size": 32100,
3214
+ "n_bytes": 1774770,
3215
+ "n_tokens": 51947,
3216
+ "n_chars": 603065
3217
+ },
3218
+ "t5_small.cc100-ja": {
3219
+ "vocab_size": 32100,
3220
+ "n_bytes": 1774770,
3221
+ "n_tokens": 51947,
3222
+ "n_chars": 603065
3223
+ },
3224
+ "text_davinci_003.cc100-ja": {
3225
+ "vocab_size": 50281,
3226
+ "n_bytes": 1774770,
3227
+ "n_tokens": 844362,
3228
+ "n_chars": 603065
3229
+ },
3230
+ "tigerbot_13b_chat_v2.cc100-ja": {
3231
+ "vocab_size": 60515,
3232
+ "n_bytes": 1774770,
3233
+ "n_tokens": 567792,
3234
+ "n_chars": 603065
3235
+ },
3236
+ "tigerbot_70b_chat_v4_4k.cc100-ja": {
3237
+ "vocab_size": 65110,
3238
+ "n_bytes": 1774770,
3239
+ "n_tokens": 406571,
3240
+ "n_chars": 603065
3241
+ },
3242
+ "wizardcoder_15b_v1.cc100-ja": {
3243
+ "vocab_size": 49153,
3244
+ "n_bytes": 1774770,
3245
+ "n_tokens": 546876,
3246
+ "n_chars": 603065
3247
+ },
3248
+ "wizardcoder_python_7b_v1.cc100-ja": {
3249
+ "vocab_size": 32001,
3250
+ "n_bytes": 1774770,
3251
+ "n_tokens": 728461,
3252
+ "n_chars": 603065
3253
+ },
3254
+ "wizardlm_7b_v1.cc100-ja": {
3255
+ "vocab_size": 32001,
3256
+ "n_bytes": 1774770,
3257
+ "n_tokens": 728461,
3258
+ "n_chars": 603065
3259
+ },
3260
+ "wizardmath_70b_v1.cc100-ja": {
3261
+ "vocab_size": 32002,
3262
+ "n_bytes": 1774770,
3263
+ "n_tokens": 728461,
3264
+ "n_chars": 603065
3265
+ },
3266
+ "xlm_roberta.cc100-ja": {
3267
+ "vocab_size": 250002,
3268
+ "n_bytes": 1774770,
3269
+ "n_tokens": 344820,
3270
+ "n_chars": 603065
3271
+ },
3272
+ "yi_34b.cc100-ja": {
3273
+ "vocab_size": 64000,
3274
+ "n_bytes": 1774770,
3275
+ "n_tokens": 740791,
3276
+ "n_chars": 603065
3277
+ },
3278
+ "yi_6b.cc100-ja": {
3279
+ "vocab_size": 64000,
3280
+ "n_bytes": 1774770,
3281
+ "n_tokens": 740791,
3282
+ "n_chars": 603065
3283
+ },
3284
+ "yi_vl34b.cc100-ja": {
3285
+ "vocab_size": 64000,
3286
+ "n_bytes": 1774770,
3287
+ "n_tokens": 749927,
3288
+ "n_chars": 603065
3289
+ },
3290
+ "zephyr_7b_beta.cc100-ja": {
3291
+ "vocab_size": 32000,
3292
+ "n_bytes": 1774770,
3293
+ "n_tokens": 685134,
3294
+ "n_chars": 603065
3295
+ },
3296
+ "llama_3_chinese_8b.cc100-ar": {
3297
+ "vocab_size": 128256,
3298
+ "n_bytes": 2813283,
3299
+ "n_tokens": 625514,
3300
+ "n_chars": 1560987
3301
+ },
3302
+ "llama_3_chinese_8b.cc100-de": {
3303
+ "vocab_size": 128256,
3304
+ "n_bytes": 1814876,
3305
+ "n_tokens": 509766,
3306
+ "n_chars": 1784021
3307
+ },
3308
+ "llama_3_chinese_8b.cc100-en": {
3309
+ "vocab_size": 128256,
3310
+ "n_bytes": 1124813,
3311
+ "n_tokens": 264944,
3312
+ "n_chars": 1121360
3313
+ },
3314
+ "llama_3_chinese_8b.cc100-es": {
3315
+ "vocab_size": 128256,
3316
+ "n_bytes": 1664455,
3317
+ "n_tokens": 443289,
3318
+ "n_chars": 1630297
3319
+ },
3320
+ "aya_101.cc100-fa": {
3321
+ "vocab_size": 250100,
3322
+ "n_bytes": 2054052,
3323
+ "n_tokens": 429922,
3324
+ "n_chars": 1145876
3325
+ },
3326
+ "baichuan.cc100-fa": {
3327
+ "vocab_size": 64000,
3328
+ "n_bytes": 2054052,
3329
+ "n_tokens": 1142057,
3330
+ "n_chars": 1145876
3331
+ },
3332
+ "baichuan2.cc100-fa": {
3333
+ "vocab_size": 125696,
3334
+ "n_bytes": 2054052,
3335
+ "n_tokens": 1052077,
3336
+ "n_chars": 1145876
3337
+ },
3338
+ "bert_base_cased.cc100-fa": {
3339
+ "vocab_size": 28996,
3340
+ "n_bytes": 2054052,
3341
+ "n_tokens": 903078,
3342
+ "n_chars": 1145876
3343
+ },
3344
+ "bert_base_chinese.cc100-fa": {
3345
+ "vocab_size": 21128,
3346
+ "n_bytes": 2054052,
3347
+ "n_tokens": 396414,
3348
+ "n_chars": 1145876
3349
+ },
3350
+ "bert_base_uncased.cc100-fa": {
3351
+ "vocab_size": 30522,
3352
+ "n_bytes": 2054052,
3353
+ "n_tokens": 910783,
3354
+ "n_chars": 1145876
3355
+ },
3356
+ "bloom.cc100-fa": {
3357
+ "vocab_size": 250680,
3358
+ "n_bytes": 2054052,
3359
+ "n_tokens": 434406,
3360
+ "n_chars": 1145876
3361
+ },
3362
+ "byt5_small.cc100-fa": {
3363
+ "vocab_size": 384,
3364
+ "n_bytes": 2054052,
3365
+ "n_tokens": 2064052,
3366
+ "n_chars": 1145876
3367
+ },
3368
+ "character_glm_6b.cc100-fa": {
3369
+ "vocab_size": 64789,
3370
+ "n_bytes": 2054052,
3371
+ "n_tokens": 1165051,
3372
+ "n_chars": 1145876
3373
+ },
3374
+ "chatglm2_6b.cc100-fa": {
3375
+ "vocab_size": 64787,
3376
+ "n_bytes": 2054052,
3377
+ "n_tokens": 1165051,
3378
+ "n_chars": 1145876
3379
+ },
3380
+ "chatglm3_6b.cc100-fa": {
3381
+ "vocab_size": 64796,
3382
+ "n_bytes": 2054052,
3383
+ "n_tokens": 1165051,
3384
+ "n_chars": 1145876
3385
+ },
3386
+ "chatglm_6b.cc100-fa": {
3387
+ "vocab_size": 150344,
3388
+ "n_bytes": 2054052,
3389
+ "n_tokens": 910808,
3390
+ "n_chars": 1145876
3391
+ },
3392
+ "chatyuan_large_v2.cc100-fa": {
3393
+ "vocab_size": 32128,
3394
+ "n_bytes": 2054052,
3395
+ "n_tokens": 740377,
3396
+ "n_chars": 1145876
3397
+ },
3398
+ "chinese_llama.cc100-fa": {
3399
+ "vocab_size": 49953,
3400
+ "n_bytes": 2054052,
3401
+ "n_tokens": 1150750,
3402
+ "n_chars": 1145876
3403
+ },
3404
+ "chinese_llama2.cc100-fa": {
3405
+ "vocab_size": 55296,
3406
+ "n_bytes": 2054052,
3407
+ "n_tokens": 1155078,
3408
+ "n_chars": 1145876
3409
+ },
3410
+ "code_davinci_002.cc100-fa": {
3411
+ "vocab_size": 50281,
3412
+ "n_bytes": 2054052,
3413
+ "n_tokens": 1292300,
3414
+ "n_chars": 1145876
3415
+ },
3416
+ "crystal_coder.cc100-fa": {
3417
+ "vocab_size": 32022,
3418
+ "n_bytes": 2054052,
3419
+ "n_tokens": 1145076,
3420
+ "n_chars": 1145876
3421
+ },
3422
+ "dbrx_instruct.cc100-fa": {
3423
+ "vocab_size": 100280,
3424
+ "n_bytes": 2054052,
3425
+ "n_tokens": 818067,
3426
+ "n_chars": 1145876
3427
+ },
3428
+ "deepseek_coder_33b_instruct.cc100-fa": {
3429
+ "vocab_size": 32022,
3430
+ "n_bytes": 2054052,
3431
+ "n_tokens": 1326109,
3432
+ "n_chars": 1145876
3433
+ },
3434
+ "deepseek_llm_7b_base.cc100-fa": {
3435
+ "vocab_size": 100015,
3436
+ "n_bytes": 2054052,
3437
+ "n_tokens": 973451,
3438
+ "n_chars": 1145876
3439
+ },
3440
+ "falcon_180b.cc100-fa": {
3441
+ "vocab_size": 65024,
3442
+ "n_bytes": 2054052,
3443
+ "n_tokens": 1246580,
3444
+ "n_chars": 1145876
3445
+ },
3446
+ "falcon_7b.cc100-fa": {
3447
+ "vocab_size": 65024,
3448
+ "n_bytes": 2054052,
3449
+ "n_tokens": 1246580,
3450
+ "n_chars": 1145876
3451
+ },
3452
+ "fastchat_t5_3b.cc100-fa": {
3453
+ "vocab_size": 32110,
3454
+ "n_bytes": 2054052,
3455
+ "n_tokens": 712443,
3456
+ "n_chars": 1145876
3457
+ },
3458
+ "flan_t5_base.cc100-fa": {
3459
+ "vocab_size": 32100,
3460
+ "n_bytes": 2054052,
3461
+ "n_tokens": 493779,
3462
+ "n_chars": 1145876
3463
+ },
3464
+ "gemma_7b.cc100-fa": {
3465
+ "vocab_size": 256000,
3466
+ "n_bytes": 2054052,
3467
+ "n_tokens": 373762,
3468
+ "n_chars": 1145876
3469
+ },
3470
+ "gpt2.cc100-fa": {
3471
+ "vocab_size": 50257,
3472
+ "n_bytes": 2054052,
3473
+ "n_tokens": 1292300,
3474
+ "n_chars": 1145876
3475
+ },
3476
+ "gpt2_chinese.cc100-fa": {
3477
+ "vocab_size": 21128,
3478
+ "n_bytes": 2054052,
3479
+ "n_tokens": 406174,
3480
+ "n_chars": 1145876
3481
+ },
3482
+ "gpt_35_turbo.cc100-fa": {
3483
+ "vocab_size": 100277,
3484
+ "n_bytes": 2054052,
3485
+ "n_tokens": 818067,
3486
+ "n_chars": 1145876
3487
+ },
3488
+ "gpt_4.cc100-fa": {
3489
+ "vocab_size": 100277,
3490
+ "n_bytes": 2054052,
3491
+ "n_tokens": 818067,
3492
+ "n_chars": 1145876
3493
+ },
3494
+ "gpt_neox_japanese_2_7b.cc100-fa": {
3495
+ "vocab_size": 32000,
3496
+ "n_bytes": 2054052,
3497
+ "n_tokens": 2036715,
3498
+ "n_chars": 1145876
3499
+ },
3500
+ "gpt_nexo_20b.cc100-fa": {
3501
+ "vocab_size": 50277,
3502
+ "n_bytes": 2054052,
3503
+ "n_tokens": 866434,
3504
+ "n_chars": 1145876
3505
+ },
3506
+ "grok_1.cc100-fa": {
3507
+ "vocab_size": 131072,
3508
+ "n_bytes": 2054052,
3509
+ "n_tokens": 1073281,
3510
+ "n_chars": 1145876
3511
+ },
3512
+ "internlm2_chat_7b.cc100-fa": {
3513
+ "vocab_size": 92544,
3514
+ "n_bytes": 2054052,
3515
+ "n_tokens": 1195032,
3516
+ "n_chars": 1145876
3517
+ },
3518
+ "internlm2_math_7b.cc100-fa": {
3519
+ "vocab_size": 92544,
3520
+ "n_bytes": 2054052,
3521
+ "n_tokens": 1195032,
3522
+ "n_chars": 1145876
3523
+ },
3524
+ "internlm_chat_7b.cc100-fa": {
3525
+ "vocab_size": 103168,
3526
+ "n_bytes": 2054052,
3527
+ "n_tokens": 640945,
3528
+ "n_chars": 1145876
3529
+ },
3530
+ "internlm_xcomposer_7b.cc100-fa": {
3531
+ "vocab_size": 103168,
3532
+ "n_bytes": 2054052,
3533
+ "n_tokens": 640945,
3534
+ "n_chars": 1145876
3535
+ },
3536
+ "jamba_v0_1.cc100-fa": {
3537
+ "vocab_size": 65536,
3538
+ "n_bytes": 2054052,
3539
+ "n_tokens": 732550,
3540
+ "n_chars": 1145876
3541
+ },
3542
+ "kplug.cc100-fa": {
3543
+ "vocab_size": 10261,
3544
+ "n_bytes": 2054052,
3545
+ "n_tokens": 274671,
3546
+ "n_chars": 1145876
3547
+ },
3548
+ "llama.cc100-fa": {
3549
+ "vocab_size": 32000,
3550
+ "n_bytes": 2054052,
3551
+ "n_tokens": 1155076,
3552
+ "n_chars": 1145876
3553
+ },
3554
+ "llama2.cc100-fa": {
3555
+ "vocab_size": 32001,
3556
+ "n_bytes": 2054052,
3557
+ "n_tokens": 1155076,
3558
+ "n_chars": 1145876
3559
+ },
3560
+ "llama3.cc100-fa": {
3561
+ "vocab_size": 128256,
3562
+ "n_bytes": 2054052,
3563
+ "n_tokens": 387448,
3564
+ "n_chars": 1145876
3565
+ },
3566
+ "llama_3_chinese_8b.cc100-fa": {
3567
+ "vocab_size": 128256,
3568
+ "n_bytes": 2054052,
3569
+ "n_tokens": 397448,
3570
+ "n_chars": 1145876
3571
+ },
3572
+ "mistral_7b.cc100-fa": {
3573
+ "vocab_size": 32000,
3574
+ "n_bytes": 2054052,
3575
+ "n_tokens": 1133278,
3576
+ "n_chars": 1145876
3577
+ },
3578
+ "mixtral_8_7b.cc100-fa": {
3579
+ "vocab_size": 32000,
3580
+ "n_bytes": 2054052,
3581
+ "n_tokens": 1133278,
3582
+ "n_chars": 1145876
3583
+ },
3584
+ "mobilebert_uncased.cc100-fa": {
3585
+ "vocab_size": 30522,
3586
+ "n_bytes": 2054052,
3587
+ "n_tokens": 910783,
3588
+ "n_chars": 1145876
3589
+ },
3590
+ "moss.cc100-fa": {
3591
+ "vocab_size": 106072,
3592
+ "n_bytes": 2054052,
3593
+ "n_tokens": 1285426,
3594
+ "n_chars": 1145876
3595
+ },
3596
+ "mt5_large.cc100-fa": {
3597
+ "vocab_size": 250100,
3598
+ "n_bytes": 2054052,
3599
+ "n_tokens": 429922,
3600
+ "n_chars": 1145876
3601
+ },
3602
+ "olmo_7b.cc100-fa": {
3603
+ "vocab_size": 50280,
3604
+ "n_bytes": 2054052,
3605
+ "n_tokens": 866434,
3606
+ "n_chars": 1145876
3607
+ },
3608
+ "orion_14b_chat.cc100-fa": {
3609
+ "vocab_size": 84608,
3610
+ "n_bytes": 2054052,
3611
+ "n_tokens": 1131108,
3612
+ "n_chars": 1145876
3613
+ },
3614
+ "phi_1.cc100-fa": {
3615
+ "vocab_size": 50295,
3616
+ "n_bytes": 2054052,
3617
+ "n_tokens": 1292300,
3618
+ "n_chars": 1145876
3619
+ },
3620
+ "phi_2.cc100-fa": {
3621
+ "vocab_size": 50295,
3622
+ "n_bytes": 2054052,
3623
+ "n_tokens": 1292300,
3624
+ "n_chars": 1145876
3625
+ },
3626
+ "phi_3_mini.cc100-fa": {
3627
+ "vocab_size": 32011,
3628
+ "n_bytes": 2054052,
3629
+ "n_tokens": 1155076,
3630
+ "n_chars": 1145876
3631
+ },
3632
+ "pko_t5_large.cc100-fa": {
3633
+ "vocab_size": 50358,
3634
+ "n_bytes": 2054052,
3635
+ "n_tokens": 2061040,
3636
+ "n_chars": 1145876
3637
+ },
3638
+ "prompt_clue.cc100-fa": {
3639
+ "vocab_size": 32128,
3640
+ "n_bytes": 2054052,
3641
+ "n_tokens": 740377,
3642
+ "n_chars": 1145876
3643
+ },
3644
+ "qwen1_5_14b_chat.cc100-fa": {
3645
+ "vocab_size": 151646,
3646
+ "n_bytes": 2054052,
3647
+ "n_tokens": 643421,
3648
+ "n_chars": 1145876
3649
+ },
3650
+ "qwen_1_8b_chat.cc100-fa": {
3651
+ "vocab_size": 151851,
3652
+ "n_bytes": 2054052,
3653
+ "n_tokens": 643421,
3654
+ "n_chars": 1145876
3655
+ },
3656
+ "qwen_72b_chat.cc100-fa": {
3657
+ "vocab_size": 151851,
3658
+ "n_bytes": 2054052,
3659
+ "n_tokens": 643421,
3660
+ "n_chars": 1145876
3661
+ },
3662
+ "qwen_7b_chat.cc100-fa": {
3663
+ "vocab_size": 151851,
3664
+ "n_bytes": 2054052,
3665
+ "n_tokens": 643421,
3666
+ "n_chars": 1145876
3667
+ },
3668
+ "roberta_chinese_clue.cc100-fa": {
3669
+ "vocab_size": 8021,
3670
+ "n_bytes": 2054052,
3671
+ "n_tokens": 407763,
3672
+ "n_chars": 1145876
3673
+ },
3674
+ "skywork_13b_base.cc100-fa": {
3675
+ "vocab_size": 65519,
3676
+ "n_bytes": 2054052,
3677
+ "n_tokens": 1155072,
3678
+ "n_chars": 1145876
3679
+ },
3680
+ "skywork_13b_math.cc100-fa": {
3681
+ "vocab_size": 65519,
3682
+ "n_bytes": 2054052,
3683
+ "n_tokens": 1155072,
3684
+ "n_chars": 1145876
3685
+ },
3686
+ "solar_10_7b.cc100-fa": {
3687
+ "vocab_size": 32000,
3688
+ "n_bytes": 2054052,
3689
+ "n_tokens": 1133278,
3690
+ "n_chars": 1145876
3691
+ },
3692
+ "starchat_alpha.cc100-fa": {
3693
+ "vocab_size": 49156,
3694
+ "n_bytes": 2054052,
3695
+ "n_tokens": 851630,
3696
+ "n_chars": 1145876
3697
+ },
3698
+ "switch_c_2048.cc100-fa": {
3699
+ "vocab_size": 32100,
3700
+ "n_bytes": 2054052,
3701
+ "n_tokens": 493767,
3702
+ "n_chars": 1145876
3703
+ },
3704
+ "t5_base.cc100-fa": {
3705
+ "vocab_size": 32100,
3706
+ "n_bytes": 2054052,
3707
+ "n_tokens": 493767,
3708
+ "n_chars": 1145876
3709
+ },
3710
+ "t5_large.cc100-fa": {
3711
+ "vocab_size": 32100,
3712
+ "n_bytes": 2054052,
3713
+ "n_tokens": 493767,
3714
+ "n_chars": 1145876
3715
+ },
3716
+ "t5_small.cc100-fa": {
3717
+ "vocab_size": 32100,
3718
+ "n_bytes": 2054052,
3719
+ "n_tokens": 493767,
3720
+ "n_chars": 1145876
3721
+ },
3722
+ "text_davinci_003.cc100-fa": {
3723
+ "vocab_size": 50281,
3724
+ "n_bytes": 2054052,
3725
+ "n_tokens": 1292300,
3726
+ "n_chars": 1145876
3727
+ },
3728
+ "tigerbot_13b_chat_v2.cc100-fa": {
3729
+ "vocab_size": 60515,
3730
+ "n_bytes": 2054052,
3731
+ "n_tokens": 1145046,
3732
+ "n_chars": 1145876
3733
+ },
3734
+ "tigerbot_70b_chat_v4_4k.cc100-fa": {
3735
+ "vocab_size": 65110,
3736
+ "n_bytes": 2054052,
3737
+ "n_tokens": 1145048,
3738
+ "n_chars": 1145876
3739
+ },
3740
+ "wizardcoder_15b_v1.cc100-fa": {
3741
+ "vocab_size": 49153,
3742
+ "n_bytes": 2054052,
3743
+ "n_tokens": 851630,
3744
+ "n_chars": 1145876
3745
+ },
3746
+ "wizardcoder_python_7b_v1.cc100-fa": {
3747
+ "vocab_size": 32001,
3748
+ "n_bytes": 2054052,
3749
+ "n_tokens": 1155076,
3750
+ "n_chars": 1145876
3751
+ },
3752
+ "wizardlm_7b_v1.cc100-fa": {
3753
+ "vocab_size": 32001,
3754
+ "n_bytes": 2054052,
3755
+ "n_tokens": 1155076,
3756
+ "n_chars": 1145876
3757
+ },
3758
+ "wizardmath_70b_v1.cc100-fa": {
3759
+ "vocab_size": 32002,
3760
+ "n_bytes": 2054052,
3761
+ "n_tokens": 1155076,
3762
+ "n_chars": 1145876
3763
+ },
3764
+ "xlm_roberta.cc100-fa": {
3765
+ "vocab_size": 250002,
3766
+ "n_bytes": 2054052,
3767
+ "n_tokens": 330926,
3768
+ "n_chars": 1145876
3769
+ },
3770
+ "yi_34b.cc100-fa": {
3771
+ "vocab_size": 64000,
3772
+ "n_bytes": 2054052,
3773
+ "n_tokens": 1337264,
3774
+ "n_chars": 1145876
3775
+ },
3776
+ "yi_6b.cc100-fa": {
3777
+ "vocab_size": 64000,
3778
+ "n_bytes": 2054052,
3779
+ "n_tokens": 1337264,
3780
+ "n_chars": 1145876
3781
+ },
3782
+ "yi_vl34b.cc100-fa": {
3783
+ "vocab_size": 64000,
3784
+ "n_bytes": 2054052,
3785
+ "n_tokens": 1346819,
3786
+ "n_chars": 1145876
3787
+ },
3788
+ "zephyr_7b_beta.cc100-fa": {
3789
+ "vocab_size": 32000,
3790
+ "n_bytes": 2054052,
3791
+ "n_tokens": 1133278,
3792
+ "n_chars": 1145876
3793
+ },
3794
+ "llama_3_chinese_8b.cc100-fr": {
3795
+ "vocab_size": 128256,
3796
+ "n_bytes": 1540504,
3797
+ "n_tokens": 422146,
3798
+ "n_chars": 1484970
3799
+ },
3800
+ "llama_3_chinese_8b.cc100-ja": {
3801
+ "vocab_size": 128256,
3802
+ "n_bytes": 1774770,
3803
+ "n_tokens": 424715,
3804
+ "n_chars": 603065
3805
+ },
3806
+ "aya_101.cc100-ko": {
3807
+ "vocab_size": 250100,
3808
+ "n_bytes": 1524839,
3809
+ "n_tokens": 434586,
3810
+ "n_chars": 655190
3811
+ },
3812
+ "baichuan.cc100-ko": {
3813
+ "vocab_size": 64000,
3814
+ "n_bytes": 1524839,
3815
+ "n_tokens": 639258,
3816
+ "n_chars": 655190
3817
+ },
3818
+ "baichuan2.cc100-ko": {
3819
+ "vocab_size": 125696,
3820
+ "n_bytes": 1524839,
3821
+ "n_tokens": 623358,
3822
+ "n_chars": 655190
3823
+ },
3824
+ "bert_base_cased.cc100-ko": {
3825
+ "vocab_size": 28996,
3826
+ "n_bytes": 1524839,
3827
+ "n_tokens": 222828,
3828
+ "n_chars": 655190
3829
+ },
3830
+ "bert_base_chinese.cc100-ko": {
3831
+ "vocab_size": 21128,
3832
+ "n_bytes": 1524839,
3833
+ "n_tokens": 219752,
3834
+ "n_chars": 655190
3835
+ },
3836
+ "bert_base_uncased.cc100-ko": {
3837
+ "vocab_size": 30522,
3838
+ "n_bytes": 1524839,
3839
+ "n_tokens": 904756,
3840
+ "n_chars": 655190
3841
+ },
3842
+ "bloom.cc100-ko": {
3843
+ "vocab_size": 250680,
3844
+ "n_bytes": 1524839,
3845
+ "n_tokens": 742111,
3846
+ "n_chars": 655190
3847
+ },
3848
+ "byt5_small.cc100-ko": {
3849
+ "vocab_size": 384,
3850
+ "n_bytes": 1524839,
3851
+ "n_tokens": 1534839,
3852
+ "n_chars": 655190
3853
+ },
3854
+ "character_glm_6b.cc100-ko": {
3855
+ "vocab_size": 64789,
3856
+ "n_bytes": 1524839,
3857
+ "n_tokens": 672160,
3858
+ "n_chars": 655190
3859
+ },
3860
+ "chatglm2_6b.cc100-ko": {
3861
+ "vocab_size": 64787,
3862
+ "n_bytes": 1524839,
3863
+ "n_tokens": 672156,
3864
+ "n_chars": 655190
3865
+ },
3866
+ "chatglm3_6b.cc100-ko": {
3867
+ "vocab_size": 64796,
3868
+ "n_bytes": 1524839,
3869
+ "n_tokens": 672160,
3870
+ "n_chars": 655190
3871
+ },
3872
+ "chatglm_6b.cc100-ko": {
3873
+ "vocab_size": 150344,
3874
+ "n_bytes": 1524839,
3875
+ "n_tokens": 939630,
3876
+ "n_chars": 655190
3877
+ },
3878
+ "chatyuan_large_v2.cc100-ko": {
3879
+ "vocab_size": 32128,
3880
+ "n_bytes": 1524839,
3881
+ "n_tokens": 354411,
3882
+ "n_chars": 655190
3883
+ },
3884
+ "chinese_llama.cc100-ko": {
3885
+ "vocab_size": 49953,
3886
+ "n_bytes": 1524839,
3887
+ "n_tokens": 913553,
3888
+ "n_chars": 655190
3889
+ },
3890
+ "chinese_llama2.cc100-ko": {
3891
+ "vocab_size": 55296,
3892
+ "n_bytes": 1524839,
3893
+ "n_tokens": 963427,
3894
+ "n_chars": 655190
3895
+ },
3896
+ "code_davinci_002.cc100-ko": {
3897
+ "vocab_size": 50281,
3898
+ "n_bytes": 1524839,
3899
+ "n_tokens": 1308993,
3900
+ "n_chars": 655190
3901
+ },
3902
+ "crystal_coder.cc100-ko": {
3903
+ "vocab_size": 32022,
3904
+ "n_bytes": 1524839,
3905
+ "n_tokens": 954428,
3906
+ "n_chars": 655190
3907
+ },
3908
+ "dbrx_instruct.cc100-ko": {
3909
+ "vocab_size": 100280,
3910
+ "n_bytes": 1524839,
3911
+ "n_tokens": 652277,
3912
+ "n_chars": 655190
3913
+ },
3914
+ "deepseek_coder_33b_instruct.cc100-ko": {
3915
+ "vocab_size": 32022,
3916
+ "n_bytes": 1524839,
3917
+ "n_tokens": 1454805,
3918
+ "n_chars": 655190
3919
+ },
3920
+ "deepseek_llm_7b_base.cc100-ko": {
3921
+ "vocab_size": 100015,
3922
+ "n_bytes": 1524839,
3923
+ "n_tokens": 1081983,
3924
+ "n_chars": 655190
3925
+ },
3926
+ "falcon_180b.cc100-ko": {
3927
+ "vocab_size": 65024,
3928
+ "n_bytes": 1524839,
3929
+ "n_tokens": 1330568,
3930
+ "n_chars": 655190
3931
+ },
3932
+ "falcon_7b.cc100-ko": {
3933
+ "vocab_size": 65024,
3934
+ "n_bytes": 1524839,
3935
+ "n_tokens": 1330568,
3936
+ "n_chars": 655190
3937
+ },
3938
+ "fastchat_t5_3b.cc100-ko": {
3939
+ "vocab_size": 32110,
3940
+ "n_bytes": 1524839,
3941
+ "n_tokens": 484953,
3942
+ "n_chars": 655190
3943
+ },
3944
+ "flan_t5_base.cc100-ko": {
3945
+ "vocab_size": 32100,
3946
+ "n_bytes": 1524839,
3947
+ "n_tokens": 344457,
3948
+ "n_chars": 655190
3949
+ },
3950
+ "gemma_7b.cc100-ko": {
3951
+ "vocab_size": 256000,
3952
+ "n_bytes": 1524839,
3953
+ "n_tokens": 464410,
3954
+ "n_chars": 655190
3955
+ },
3956
+ "gpt2.cc100-ko": {
3957
+ "vocab_size": 50257,
3958
+ "n_bytes": 1524839,
3959
+ "n_tokens": 1309029,
3960
+ "n_chars": 655190
3961
+ },
3962
+ "gpt2_chinese.cc100-ko": {
3963
+ "vocab_size": 21128,
3964
+ "n_bytes": 1524839,
3965
+ "n_tokens": 1055974,
3966
+ "n_chars": 655190
3967
+ },
3968
+ "gpt_35_turbo.cc100-ko": {
3969
+ "vocab_size": 100277,
3970
+ "n_bytes": 1524839,
3971
+ "n_tokens": 652277,
3972
+ "n_chars": 655190
3973
+ },
3974
+ "gpt_4.cc100-ko": {
3975
+ "vocab_size": 100277,
3976
+ "n_bytes": 1524839,
3977
+ "n_tokens": 652277,
3978
+ "n_chars": 655190
3979
+ },
3980
+ "gpt_neox_japanese_2_7b.cc100-ko": {
3981
+ "vocab_size": 32000,
3982
+ "n_bytes": 1524839,
3983
+ "n_tokens": 1512832,
3984
+ "n_chars": 655190
3985
+ },
3986
+ "gpt_nexo_20b.cc100-ko": {
3987
+ "vocab_size": 50277,
3988
+ "n_bytes": 1524839,
3989
+ "n_tokens": 973288,
3990
+ "n_chars": 655190
3991
+ },
3992
+ "grok_1.cc100-ko": {
3993
+ "vocab_size": 131072,
3994
+ "n_bytes": 1524839,
3995
+ "n_tokens": 1152005,
3996
+ "n_chars": 655190
3997
+ },
3998
+ "internlm2_chat_7b.cc100-ko": {
3999
+ "vocab_size": 92544,
4000
+ "n_bytes": 1524839,
4001
+ "n_tokens": 1008524,
4002
+ "n_chars": 655190
4003
+ },
4004
+ "internlm2_math_7b.cc100-ko": {
4005
+ "vocab_size": 92544,
4006
+ "n_bytes": 1524839,
4007
+ "n_tokens": 1008524,
4008
+ "n_chars": 655190
4009
+ },
4010
+ "internlm_chat_7b.cc100-ko": {
4011
+ "vocab_size": 103168,
4012
+ "n_bytes": 1524839,
4013
+ "n_tokens": 839609,
4014
+ "n_chars": 655190
4015
+ },
4016
+ "internlm_xcomposer_7b.cc100-ko": {
4017
+ "vocab_size": 103168,
4018
+ "n_bytes": 1524839,
4019
+ "n_tokens": 839609,
4020
+ "n_chars": 655190
4021
+ },
4022
+ "jamba_v0_1.cc100-ko": {
4023
+ "vocab_size": 65536,
4024
+ "n_bytes": 1524839,
4025
+ "n_tokens": 715688,
4026
+ "n_chars": 655190
4027
+ },
4028
+ "kplug.cc100-ko": {
4029
+ "vocab_size": 10261,
4030
+ "n_bytes": 1524839,
4031
+ "n_tokens": 222771,
4032
+ "n_chars": 655190
4033
+ },
4034
+ "llama.cc100-ko": {
4035
+ "vocab_size": 32000,
4036
+ "n_bytes": 1524839,
4037
+ "n_tokens": 964428,
4038
+ "n_chars": 655190
4039
+ },
4040
+ "llama2.cc100-ko": {
4041
+ "vocab_size": 32001,
4042
+ "n_bytes": 1524839,
4043
+ "n_tokens": 964428,
4044
+ "n_chars": 655190
4045
+ },
4046
+ "llama3.cc100-ko": {
4047
+ "vocab_size": 128256,
4048
+ "n_bytes": 1524839,
4049
+ "n_tokens": 412595,
4050
+ "n_chars": 655190
4051
+ },
4052
+ "llama_3_chinese_8b.cc100-ko": {
4053
+ "vocab_size": 128256,
4054
+ "n_bytes": 1524839,
4055
+ "n_tokens": 422595,
4056
+ "n_chars": 655190
4057
+ },
4058
+ "mistral_7b.cc100-ko": {
4059
+ "vocab_size": 32000,
4060
+ "n_bytes": 1524839,
4061
+ "n_tokens": 728766,
4062
+ "n_chars": 655190
4063
+ },
4064
+ "mixtral_8_7b.cc100-ko": {
4065
+ "vocab_size": 32000,
4066
+ "n_bytes": 1524839,
4067
+ "n_tokens": 728766,
4068
+ "n_chars": 655190
4069
+ },
4070
+ "mobilebert_uncased.cc100-ko": {
4071
+ "vocab_size": 30522,
4072
+ "n_bytes": 1524839,
4073
+ "n_tokens": 904756,
4074
+ "n_chars": 655190
4075
+ },
4076
+ "moss.cc100-ko": {
4077
+ "vocab_size": 106072,
4078
+ "n_bytes": 1524839,
4079
+ "n_tokens": 1305249,
4080
+ "n_chars": 655190
4081
+ },
4082
+ "mt5_large.cc100-ko": {
4083
+ "vocab_size": 250100,
4084
+ "n_bytes": 1524839,
4085
+ "n_tokens": 434586,
4086
+ "n_chars": 655190
4087
+ },
4088
+ "olmo_7b.cc100-ko": {
4089
+ "vocab_size": 50280,
4090
+ "n_bytes": 1524839,
4091
+ "n_tokens": 973288,
4092
+ "n_chars": 655190
4093
+ },
4094
+ "orion_14b_chat.cc100-ko": {
4095
+ "vocab_size": 84608,
4096
+ "n_bytes": 1524839,
4097
+ "n_tokens": 351149,
4098
+ "n_chars": 655190
4099
+ },
4100
+ "phi_1.cc100-ko": {
4101
+ "vocab_size": 50295,
4102
+ "n_bytes": 1524839,
4103
+ "n_tokens": 1308988,
4104
+ "n_chars": 655190
4105
+ },
4106
+ "phi_2.cc100-ko": {
4107
+ "vocab_size": 50295,
4108
+ "n_bytes": 1524839,
4109
+ "n_tokens": 1308988,
4110
+ "n_chars": 655190
4111
+ },
4112
+ "phi_3_mini.cc100-ko": {
4113
+ "vocab_size": 32011,
4114
+ "n_bytes": 1524839,
4115
+ "n_tokens": 964428,
4116
+ "n_chars": 655190
4117
+ },
4118
+ "pko_t5_large.cc100-ko": {
4119
+ "vocab_size": 50358,
4120
+ "n_bytes": 1524839,
4121
+ "n_tokens": 471643,
4122
+ "n_chars": 655190
4123
+ },
4124
+ "prompt_clue.cc100-ko": {
4125
+ "vocab_size": 32128,
4126
+ "n_bytes": 1524839,
4127
+ "n_tokens": 354411,
4128
+ "n_chars": 655190
4129
+ },
4130
+ "qwen1_5_14b_chat.cc100-ko": {
4131
+ "vocab_size": 151646,
4132
+ "n_bytes": 1524839,
4133
+ "n_tokens": 457492,
4134
+ "n_chars": 655190
4135
+ },
4136
+ "qwen_1_8b_chat.cc100-ko": {
4137
+ "vocab_size": 151851,
4138
+ "n_bytes": 1524839,
4139
+ "n_tokens": 457492,
4140
+ "n_chars": 655190
4141
+ },
4142
+ "qwen_72b_chat.cc100-ko": {
4143
+ "vocab_size": 151851,
4144
+ "n_bytes": 1524839,
4145
+ "n_tokens": 457492,
4146
+ "n_chars": 655190
4147
+ },
4148
+ "qwen_7b_chat.cc100-ko": {
4149
+ "vocab_size": 151851,
4150
+ "n_bytes": 1524839,
4151
+ "n_tokens": 457492,
4152
+ "n_chars": 655190
4153
+ },
4154
+ "roberta_chinese_clue.cc100-ko": {
4155
+ "vocab_size": 8021,
4156
+ "n_bytes": 1524839,
4157
+ "n_tokens": 226812,
4158
+ "n_chars": 655190
4159
+ },
4160
+ "skywork_13b_base.cc100-ko": {
4161
+ "vocab_size": 65519,
4162
+ "n_bytes": 1524839,
4163
+ "n_tokens": 962744,
4164
+ "n_chars": 655190
4165
+ },
4166
+ "skywork_13b_math.cc100-ko": {
4167
+ "vocab_size": 65519,
4168
+ "n_bytes": 1524839,
4169
+ "n_tokens": 962744,
4170
+ "n_chars": 655190
4171
+ },
4172
+ "solar_10_7b.cc100-ko": {
4173
+ "vocab_size": 32000,
4174
+ "n_bytes": 1524839,
4175
+ "n_tokens": 728766,
4176
+ "n_chars": 655190
4177
+ },
4178
+ "starchat_alpha.cc100-ko": {
4179
+ "vocab_size": 49156,
4180
+ "n_bytes": 1524839,
4181
+ "n_tokens": 580873,
4182
+ "n_chars": 655190
4183
+ },
4184
+ "switch_c_2048.cc100-ko": {
4185
+ "vocab_size": 32100,
4186
+ "n_bytes": 1524839,
4187
+ "n_tokens": 344457,
4188
+ "n_chars": 655190
4189
+ },
4190
+ "t5_base.cc100-ko": {
4191
+ "vocab_size": 32100,
4192
+ "n_bytes": 1524839,
4193
+ "n_tokens": 344457,
4194
+ "n_chars": 655190
4195
+ },
4196
+ "t5_large.cc100-ko": {
4197
+ "vocab_size": 32100,
4198
+ "n_bytes": 1524839,
4199
+ "n_tokens": 344457,
4200
+ "n_chars": 655190
4201
+ },
4202
+ "t5_small.cc100-ko": {
4203
+ "vocab_size": 32100,
4204
+ "n_bytes": 1524839,
4205
+ "n_tokens": 344457,
4206
+ "n_chars": 655190
4207
+ },
4208
+ "text_davinci_003.cc100-ko": {
4209
+ "vocab_size": 50281,
4210
+ "n_bytes": 1524839,
4211
+ "n_tokens": 1308993,
4212
+ "n_chars": 655190
4213
+ },
4214
+ "tigerbot_13b_chat_v2.cc100-ko": {
4215
+ "vocab_size": 60515,
4216
+ "n_bytes": 1524839,
4217
+ "n_tokens": 793053,
4218
+ "n_chars": 655190
4219
+ },
4220
+ "tigerbot_70b_chat_v4_4k.cc100-ko": {
4221
+ "vocab_size": 65110,
4222
+ "n_bytes": 1524839,
4223
+ "n_tokens": 484082,
4224
+ "n_chars": 655190
4225
+ },
4226
+ "wizardcoder_15b_v1.cc100-ko": {
4227
+ "vocab_size": 49153,
4228
+ "n_bytes": 1524839,
4229
+ "n_tokens": 580873,
4230
+ "n_chars": 655190
4231
+ },
4232
+ "wizardcoder_python_7b_v1.cc100-ko": {
4233
+ "vocab_size": 32001,
4234
+ "n_bytes": 1524839,
4235
+ "n_tokens": 964428,
4236
+ "n_chars": 655190
4237
+ },
4238
+ "wizardlm_7b_v1.cc100-ko": {
4239
+ "vocab_size": 32001,
4240
+ "n_bytes": 1524839,
4241
+ "n_tokens": 964428,
4242
+ "n_chars": 655190
4243
+ },
4244
+ "wizardmath_70b_v1.cc100-ko": {
4245
+ "vocab_size": 32002,
4246
+ "n_bytes": 1524839,
4247
+ "n_tokens": 964428,
4248
+ "n_chars": 655190
4249
+ },
4250
+ "xlm_roberta.cc100-ko": {
4251
+ "vocab_size": 250002,
4252
+ "n_bytes": 1524839,
4253
+ "n_tokens": 374571,
4254
+ "n_chars": 655190
4255
+ },
4256
+ "yi_34b.cc100-ko": {
4257
+ "vocab_size": 64000,
4258
+ "n_bytes": 1524839,
4259
+ "n_tokens": 1203134,
4260
+ "n_chars": 655190
4261
+ },
4262
+ "yi_6b.cc100-ko": {
4263
+ "vocab_size": 64000,
4264
+ "n_bytes": 1524839,
4265
+ "n_tokens": 1203134,
4266
+ "n_chars": 655190
4267
+ },
4268
+ "yi_vl34b.cc100-ko": {
4269
+ "vocab_size": 64000,
4270
+ "n_bytes": 1524839,
4271
+ "n_tokens": 1210021,
4272
+ "n_chars": 655190
4273
+ },
4274
+ "zephyr_7b_beta.cc100-ko": {
4275
+ "vocab_size": 32000,
4276
+ "n_bytes": 1524839,
4277
+ "n_tokens": 728766,
4278
+ "n_chars": 655190
4279
+ },
4280
+ "llama_3_chinese_8b.cc100-zh-Hans": {
4281
+ "vocab_size": 128256,
4282
+ "n_bytes": 2633047,
4283
+ "n_tokens": 757405,
4284
+ "n_chars": 927311
4285
  }
4286
  }
utils/compression_util.py CHANGED
@@ -20,7 +20,8 @@ from typing import List, Optional, Union, Literal
20
  CURRENT_DIR = os.path.dirname(os.path.abspath(__file__))
21
 
22
  common_units = ["g_bytes/b_tokens", "b_tokens/g_bytes", "t_bytes/t_tokens", "t_tokens/t_bytes", "n_chars/n_tokens", ]
23
- common_corpuses = ["cc100-en", "cc100-zh-Hans", "cc100-es", "cc100-fr", "cc100-de", "cc100-ko" "cc100-fa", "cc100-ar"]
 
24
 
25
  VALID_CODES_CC100 = [
26
  "am", "ar", "as", "az", "be", "bg", "bn", "bn_rom", "br", "bs", "ca", "cs", "cy", "da", "de",
@@ -198,10 +199,10 @@ def test():
198
 
199
  def main():
200
  if len(sys.argv) == 3:
201
- tokenizers = [sys.argv[1]]
202
  corpuses = [sys.argv[2]]
203
  else:
204
- tokenizers = all_tokenizers[:2]
205
  corpuses = common_corpuses
206
  df = get_compression_leaderboard(corpuses)
207
  # print(df.to_markdown(index=False, tablefmt='fancy_grid'))
 
20
  CURRENT_DIR = os.path.dirname(os.path.abspath(__file__))
21
 
22
  common_units = ["g_bytes/b_tokens", "b_tokens/g_bytes", "t_bytes/t_tokens", "t_tokens/t_bytes", "n_chars/n_tokens", ]
23
+ common_corpuses = sorted(["cc100-en", "cc100-zh-Hans", "cc100-es", "cc100-fr", "cc100-de", "cc100-ko",
24
+ "cc100-fa", "cc100-ar", "cc100-ja"])
25
 
26
  VALID_CODES_CC100 = [
27
  "am", "ar", "as", "az", "be", "bg", "bn", "bn_rom", "br", "bs", "ca", "cs", "cy", "da", "de",
 
199
 
200
  def main():
201
  if len(sys.argv) == 3:
202
+ tokenizer_filter = [sys.argv[1]]
203
  corpuses = [sys.argv[2]]
204
  else:
205
+ tokenizer_filter = None
206
  corpuses = common_corpuses
207
  df = get_compression_leaderboard(corpuses)
208
  # print(df.to_markdown(index=False, tablefmt='fancy_grid'))
utils/lang_util_2.py CHANGED
@@ -53,30 +53,6 @@ def is_all_en(text):
53
 
54
 
55
 
56
- # import opencc
57
-
58
- def is_russian():
59
- """ 俄语 """
60
- pass
61
-
62
- def is_french():
63
- """ 法语 """
64
-
65
- def aa():
66
- """
67
- zh-Hans: Chinese (Simplified)
68
- :return:
69
- """
70
- pass
71
-
72
-
73
- def bb():
74
- """
75
- zh-Hant: Chinese (Traditional)
76
- :return:
77
- """
78
-
79
-
80
  ranges = [
81
  {"from": ord(u"\u3300"), "to": ord(u"\u33ff")}, # compatibility ideographs
82
  {"from": ord(u"\ufe30"), "to": ord(u"\ufe4f")}, # compatibility ideographs
 
53
 
54
 
55
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
56
  ranges = [
57
  {"from": ord(u"\u3300"), "to": ord(u"\u33ff")}, # compatibility ideographs
58
  {"from": ord(u"\ufe30"), "to": ord(u"\ufe4f")}, # compatibility ideographs
vocab/__init__.py CHANGED
@@ -107,6 +107,7 @@ all_tokenizers = [
107
  ("llama3", "", "sentencepiece"),
108
  ("chinese_llama", "", "sentencepiece"), #
109
  ("chinese_llama2", "", "sentencepiece"), #
 
110
  # ("chinese_alpaca_lora_7b", # 中文Alpaca模型在上述中文LLaMA模型的基础上进一步使用了指令数据进行精调。
111
  # ("belle_llama_ext_7b",
112
  # ("alpaca_7b",
@@ -179,7 +180,7 @@ all_tokenizers = [
179
  ("grok_1",),
180
  # ("claude",),
181
  ("gpt_nexo_20b", ),
182
- ("gpt_neox_japanese_2.7b", ),
183
 
184
  ]
185
 
 
107
  ("llama3", "", "sentencepiece"),
108
  ("chinese_llama", "", "sentencepiece"), #
109
  ("chinese_llama2", "", "sentencepiece"), #
110
+ ("llama_3_chinese_8b", "sentencepiece"),
111
  # ("chinese_alpaca_lora_7b", # 中文Alpaca模型在上述中文LLaMA模型的基础上进一步使用了指令数据进行精调。
112
  # ("belle_llama_ext_7b",
113
  # ("alpaca_7b",
 
180
  ("grok_1",),
181
  # ("claude",),
182
  ("gpt_nexo_20b", ),
183
+ ("gpt_neox_japanese_2_7b", ),
184
 
185
  ]
186
 
vocab/gpt_neox_japanese_2_7b/README.md ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+
3
+ ## vocab.txt
4
+
5
+ ```
6
+ るのは
7
+ よね
8
+ 写真,寫真,冩真,写眞,寫眞,冩眞
9
+ マイ
10
+ そん
11
+ 女性,𠨰性,⼥性,女𧢱,𠨰𧢱,⼥𧢱
12
+ 内容,內容,内㣑,内㝐,内彮,内𠕺,內㣑,內㝐,內彮,內𠕺
13
+ ```
14
+
15
+ 怎么还有不同写法??
16
+
17
+
18
+
19
+
20
+ ## 文本归一化
21
+
22
+ 以下的normalization,在生成任务中并不好。
23
+
24
+ ```
25
+ self.content_repatter1 = re.compile(r"(https?|ftp)(:\/\/[-_\.!~*\'()a-zA-Z0-9;\/?:\@&=\+$,%#]+)")
26
+ self.content_repatter2 = re.compile(r"[A-Za-z0-9\._+]*@[\-_0-9A-Za-z]+(\.[A-Za-z]+)*")
27
+ self.content_repatter3 = re.compile(r"[\(]{0,1}[0-9]{2,4}[\)\-\(]{0,1}[0-9]{2,4}[\)\-]{0,1}[0-9]{3,4}")
28
+ self.content_repatter4 = re.compile(
29
+ r"([12]\d{3}[/\-年])*(0?[1-9]|1[0-2])[/\-月]((0?[1-9]|[12][0-9]|3[01])日?)*(\d{1,2}|:|\d{1,2}時|\d{1,2}分|\(日\)|\(月\)|\(火\)|\(水\)|\(木\)|\(金\)|\(土\)|㈰|㈪|㈫|㈬|㈭|㈮|㈯)*"
30
+ )
31
+ self.content_repatter5 = re.compile(
32
+ r"(明治|大正|昭和|平成|令和|㍾|㍽|㍼|㍻|\u32ff)\d{1,2}年(0?[1-9]|1[0-2])月(0?[1-9]|[12][0-9]|3[01])日(\d{1,2}|:|\d{1,2}時|\d{1,2}分|\(日\)|\(月\)|\(火\)|\(水\)|\(木\)|\(金\)|\(土\)|㈰|㈪|㈫|㈬|㈭|㈮|㈯)*"
33
+ )
34
+ self.content_repatter6 = re.compile(
35
+ r"((0|[1-9]\d*|[1-9]\d{0,2}(,\d{3})+)*億)*((0|[1-9]\d*|[1-9]\d{0,2}(,\d{3})+)*万)*((0|[1-9]\d*|[1-9]\d{0,2}(,\d{3})+)*千)*(0|[1-9]\d*|[1-9]\d{0,2}(,\d{3})+)*(千円|万円|千万円|円|千ドル|万ドル|千万ドル|ドル|千ユーロ|万ユーロ|千万ユーロ|ユーロ)+(\(税込\)|\(税抜\)|\+tax)*"
36
+ )
37
+
38
+ def clean_text(self, content):
39
+ content = self.content_repatter1.sub("<URL>", content)
40
+ content = self.content_repatter2.sub("<EMAIL>", content)
41
+ content = self.content_repatter3.sub("<TEL>", content)
42
+ content = self.content_repatter4.sub("<DATE>", content)
43
+ content = self.content_repatter5.sub("<DATE>", content)
44
+ content = self.content_repatter6.sub("<PRICE>", content)
45
+ content = content.translate(self.content_trans1)
46
+ while "<BLOCK><BLOCK>" in content:
47
+ content = content.replace("<BLOCK><BLOCK>", "<BLOCK>")
48
+ return content
49
+
50
+ def tokenize(self, text, clean=False):
51
+ text = text.replace(" ", "<SP>")
52
+ text = text.replace(" ", "<SP>")
53
+ text = text.replace("\r\n", "<BR>")
54
+ text = text.replace("\n", "<BR>")
55
+ text = text.replace("\r", "<BR>")
56
+ text = text.replace("\t", "<TAB>")
57
+ text = text.replace("—", "ー")
58
+ text = text.replace("−", "ー")
59
+ for k, v in self.emoji["emoji"].items():
60
+ if k in text:
61
+ text = text.replace(k, v)
62
+ if clean:
63
+ text = self.clean_text(text)
64
+ ```
vocab/{gpt_neox_japanese_2.7b → gpt_neox_japanese_2_7b}/__init__.py RENAMED
@@ -1,3 +1,7 @@
 
 
 
 
1
  from transformers import AutoTokenizer
2
 
3
  tokenizer = AutoTokenizer.from_pretrained("abeja/gpt-neox-japanese-2.7b")
 
1
+ """
2
+ 这里有个 emoji.json 是干嘛的?
3
+ """
4
+
5
  from transformers import AutoTokenizer
6
 
7
  tokenizer = AutoTokenizer.from_pretrained("abeja/gpt-neox-japanese-2.7b")
vocab/gpt_neox_japanese_2_7b/test.py ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+
3
+ from transformers import AutoTokenizer, GPTNeoXJapaneseTokenizer
4
+
5
+ tokenizer = GPTNeoXJapaneseTokenizer.from_pretrained("tokenizer")
6
+
7
+ # tokenizer = AutoTokenizer.from_pretrained("abeja/gpt-neox-japanese-2.7b")
8
+
9
+ tokens = tokenizer.encode("人とAIが協調するためには http://baidu.com 🤣")
10
+
11
+ for token in tokens:
12
+ print(token, tokenizer.decode([token]))
13
+
14
+
15
+ tokens = tokenizer.tokenize("人とAIが協調するためには http://baidu.com 🤣", clean=True)
16
+ print(tokens)
17
+ # for token in tokens:
18
+ # print(token, tokenizer.decode([token]))
19
+
vocab/gpt_neox_japanese_2_7b/test_emoji.py ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ """
3
+
4
+ ## 疑问
5
+ - \u200d 是啥?
6
+ - 对emoji 划分12个类
7
+ """
8
+
9
+ import json
10
+ emoji = json.load(open("tokenizer/emoji.json", "r", encoding="utf-8"))
11
+
12
+ print(emoji)
vocab/llama_3_chinese_8b/__init__.py ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+
2
+
3
+ from transformers import AutoTokenizer
4
+
5
+ tokenizer = AutoTokenizer.from_pretrained("hfl/llama-3-chinese-8b", trust_remote_code=True)