File size: 1,835 Bytes
fca1a82 2e6b544 fca1a82 2e6b544 fca1a82 8511f3d fca1a82 8511f3d fca1a82 8511f3d fca1a82 2e6b544 fca1a82 2e6b544 41a946e 2e6b544 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 |
---
language:
- ja
thumbnail:
tags:
- xlnet
- lm-head
- causal-lm
license:
- apache-2.0
datasets:
- Japanese_Business_News
metrics:
---
# XLNet-japanese
## Model description
This model require Mecab and senetencepiece with XLNetTokenizer.
See details https://qiita.com/mkt3/items/4d0ae36f3f212aee8002
This model uses NFKD as the normalization method for character encoding.
Japanese muddle marks and semi-muddle marks will be lost.
*日本語の濁点・半濁点がないモデルです*
#### How to use
```python
from fugashi import Tagger
from transformers import (
pipeline,
XLNetLMHeadModel,
XLNetTokenizer
)
class XLNet():
def __init__(self):
self.m = Tagger('-Owakati')
self.gen_model = XLNetLMHeadModel.from_pretrained("hajime9652/xlnet-japanese")
self.gen_tokenizer = XLNetTokenizer.from_pretrained("hajime9652/xlnet-japanese")
def generate(self, prompt="福岡のご飯は美味しい。コンパクトで暮らしやすい街。"):
prompt = self.m.parse(prompt)
inputs = self.gen_tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
prompt_length = len(self.gen_tokenizer.decode(inputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True))
outputs = self.gen_model.generate(inputs, max_length=200, do_sample=True, top_p=0.95, top_k=60)
generated = prompt + self.gen_tokenizer.decode(outputs[0])[prompt_length:]
return generated
```
#### Limitations and bias
This model's training use the Japanese Business News.
# Important matter
The company that created and published this model is called Stockmark.
This repository is for use by HuggingFace and not for infringement.
See this documents https://qiita.com/mkt3/items/4d0ae36f3f212aee8002
published by https://github.com/mkt3
|