monsoon-nlp
/

bert-base-thai

@@ -6,6 +6,126 @@ language: th
 Adapted from https://github.com/ThAIKeras/bert for HuggingFace/Transformers library
 ---
 Google's [**BERT**](https://github.com/google-research/bert) is currently the state-of-the-art method of pre-training text representations which additionally provides multilingual models. ~~Unfortunately, Thai is the only one in 103 languages that is excluded due to difficulties in word segmentation.~~

 Adapted from https://github.com/ThAIKeras/bert for HuggingFace/Transformers library
+## Pre-tokenization
+You must run the original ThaiTokenizer to have your tokenization match that of the original model. If you skip this step, you will not do much better than
+mBERT or random chance!
+```bash
+pip install pythainlp six sentencepiece==0.0.9
+git clone https://github.com/ThAIKeras/bert
+# download .vocab and .model files from ThAIKeras readme
+```
+Then set up ThaiTokenizer class - this is modified slightly to
+remove a TensorFlow dependency.
+```python
+import collections
+import unicodedata
+import six
+def convert_to_unicode(text):
+  """Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
+  if six.PY3:
+    if isinstance(text, str):
+      return text
+    elif isinstance(text, bytes):
+      return text.decode("utf-8", "ignore")
+    else:
+      raise ValueError("Unsupported string type: %s" % (type(text)))
+  elif six.PY2:
+    if isinstance(text, str):
+      return text.decode("utf-8", "ignore")
+    elif isinstance(text, unicode):
+      return text
+    else:
+      raise ValueError("Unsupported string type: %s" % (type(text)))
+  else:
+    raise ValueError("Not running on Python2 or Python 3?")
+def load_vocab(vocab_file):
+  vocab = collections.OrderedDict()
+  index = 0
+  with open(vocab_file, "r") as reader:
+    while True:
+      token = reader.readline()
+      if token.split(): token = token.split()[0] # to support SentencePiece vocab file
+      token = convert_to_unicode(token)
+      if not token:
+        break
+      token = token.strip()
+      vocab[token] = index
+      index += 1
+  return vocab
+#####
+from bert.bpe_helper import BPE
+import sentencepiece as spm
+def convert_by_vocab(vocab, items):
+  output = []
+  for item in items:
+    output.append(vocab[item])
+  return output
+class ThaiTokenizer(object):
+  """Tokenizes Thai texts."""
+  def __init__(self, vocab_file, spm_file):
+    self.vocab = load_vocab(vocab_file)
+    self.inv_vocab = {v: k for k, v in self.vocab.items()}
+    self.bpe = BPE(vocab_file)
+    self.s = spm.SentencePieceProcessor()
+    self.s.Load(spm_file)
+  def tokenize(self, text):
+    bpe_tokens = self.bpe.encode(text).split(' ')
+    spm_tokens = self.s.EncodeAsPieces(text)
+    tokens = bpe_tokens if len(bpe_tokens) < len(spm_tokens) else spm_tokens
+    split_tokens = []
+    for token in tokens:
+      new_token = token
+      if token.startswith('_') and not token in self.vocab:
+        split_tokens.append('_')
+        new_token = token[1:]
+      if not new_token in self.vocab:
+        split_tokens.append('<unk>')
+      else:
+        split_tokens.append(new_token)
+    return split_tokens
+  def convert_tokens_to_ids(self, tokens):
+    return convert_by_vocab(self.vocab, tokens)
+  def convert_ids_to_tokens(self, ids):
+    return convert_by_vocab(self.inv_vocab, ids)
+```
+Then pre-tokenizing your own text:
+```python
+from pythainlp import sent_tokenize
+tokenizer = ThaiTokenizer(vocab_file='th.wiki.bpe.op25000.vocab', spm_file='th.wiki.bpe.op25000.model')
+og_text = "กรุงเทพมหานคร..."
+split_sentences = ' '.join(sent_tokenize(txt))
+split_words = ' '.join(tokenizer.tokenize(split_sentences))
+split_words
+> "▁ร้าน อาหาร ใหญ่มาก กก กก กก ▁ <unk> เลี้ยว..."
+```
+Original README follows:
 ---
 Google's [**BERT**](https://github.com/google-research/bert) is currently the state-of-the-art method of pre-training text representations which additionally provides multilingual models. ~~Unfortunately, Thai is the only one in 103 languages that is excluded due to difficulties in word segmentation.~~