|
--- |
|
language: |
|
- ja |
|
thumbnail: |
|
tags: |
|
- xlnet |
|
- lm-head |
|
- causal-lm |
|
license: |
|
- apache-2.0 |
|
datasets: |
|
- Japanese_Business_News |
|
metrics: |
|
--- |
|
|
|
# XLNet-japanese |
|
|
|
## Model description |
|
This model require Mecab and senetencepiece with XLNetTokenizer. |
|
See details https://qiita.com/mkt3/items/4d0ae36f3f212aee8002 |
|
|
|
This model uses NFKD as the normalization method for character encoding. |
|
Japanese muddle marks and semi-muddle marks will be lost. |
|
|
|
*日本語の濁点・半濁点がないモデルです* |
|
|
|
#### How to use |
|
|
|
```python |
|
from fugashi import Tagger |
|
|
|
from transformers import ( |
|
pipeline, |
|
XLNetLMHeadModel, |
|
XLNetTokenizer |
|
) |
|
|
|
class XLNet(): |
|
def __init__(self): |
|
self.m = Tagger('-Owakati') |
|
self.gen_model = XLNetLMHeadModel.from_pretrained("hajime9652/xlnet-japanese") |
|
self.gen_tokenizer = XLNetTokenizer.from_pretrained("hajime9652/xlnet-japanese") |
|
|
|
def generate(self, prompt="福岡のご飯は美味しい。コンパクトで暮らしやすい街。"): |
|
prompt = self.m.parse(prompt) |
|
inputs = self.gen_tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt") |
|
prompt_length = len(self.gen_tokenizer.decode(inputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)) |
|
outputs = self.gen_model.generate(inputs, max_length=200, do_sample=True, top_p=0.95, top_k=60) |
|
generated = prompt + self.gen_tokenizer.decode(outputs[0])[prompt_length:] |
|
return generated |
|
``` |
|
|
|
#### Limitations and bias |
|
This model's training use the Japanese Business News. |
|
|
|
|
|
# Important matter |
|
The company that created and published this model is called Stockmark. |
|
This repository is for use by HuggingFace and not for infringement. |
|
See this documents https://qiita.com/mkt3/items/4d0ae36f3f212aee8002 |
|
published by https://github.com/mkt3 |
|
|