Files changed (1) hide show
  1. README.md +29 -3
README.md CHANGED
@@ -1,3 +1,29 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+
5
+ Upstage `solar-1-mini` tokenizer
6
+ - Vocab size: 64,000
7
+ - Langauge support: English, Korean, Japanese and more
8
+
9
+ Please use this tokenizer for tokenizing inputs for the Upstage [solar-1-mini-chat](https://developers.upstage.ai/docs/apis/chat) model.
10
+
11
+ You can load it with the tokenizer library like this:
12
+
13
+ ```python
14
+ from tokenizers import Tokenizer
15
+
16
+ tokenizer = Tokenizer.from_pretrained("upstage/solar-1-mini-tokenizer")
17
+ text = "Hi, how are you?"
18
+ enc = tokenizer.encode(text)
19
+ print("Encoded input:")
20
+ print(enc)
21
+
22
+ inv_vocab = {v: k for k, v in tokenizer.get_vocab().items()}
23
+ tokens = [inv_vocab[token_id] for token_id in enc.ids]
24
+ print("Tokens:")
25
+ print(tokens)
26
+
27
+ number_of_tokens = len(enc.ids)
28
+ print("Number of tokens:", number_of_tokens)
29
+ ```