rinna
/

japanese-roberta-base

@@ -45,6 +45,10 @@ To predict a masked token, be sure to add a `[CLS]` token before the sentence fo
 A) Directly typing `[MASK]` in an input string and B) replacing a token with `[MASK]` after tokenization will yield different token sequences, and thus different prediction results. It is more appropriate to use `[MASK]` after tokenization (as it is consistent with how the model was pretrained). However, the Huggingface Inference API only supports typing `[MASK]` in the input string and produces less robust predictions.
 ## Example
 Here is an example by to illustrate how our model works as a masked language model. Notice the difference between running the following code example and running the Huggingface Inference API.
@@ -71,12 +75,16 @@ print(token_ids)  # output: [4, 1602, 44, 24, 368, 6, 11, 21583, 8]
 # convert to tensor
 import torch
-token_tensor = torch.tensor([token_ids])
 # get the top 10 predictions of the masked token
-model = model.eval()
 with torch.no_grad():
-    outputs = model(token_tensor)
     predictions = outputs[0][0, masked_idx].topk(10)
 for i, index_t in enumerate(predictions.indices):
@@ -85,16 +93,16 @@ for i, index_t in enumerate(predictions.indices):
     print(i, token)
 """
-0 ワールドカップ
-1 フェスティバル
-2 オリンピック
-3 サミット
-4 東京オリンピック
-5 総会
 6 全国大会
-7 イベント
-8 世界選手権
-9 パーティー
 """
 ~~~~

 A) Directly typing `[MASK]` in an input string and B) replacing a token with `[MASK]` after tokenization will yield different token sequences, and thus different prediction results. It is more appropriate to use `[MASK]` after tokenization (as it is consistent with how the model was pretrained). However, the Huggingface Inference API only supports typing `[MASK]` in the input string and produces less robust predictions.
+## Note 3: Provide `position_ids` as an argument explicitly
+When `position_ids` are not provided for a `Roberta*` model, Huggingface's `transformers` will automatically construct it but start from `padding_idx` instead of `0` (see [issue](https://github.com/rinnakk/japanese-pretrained-models/issues/3) and function `create_position_ids_from_input_ids()` in Huggingface's [implementation](https://github.com/huggingface/transformers/blob/master/src/transformers/models/roberta/modeling_roberta.py)), which unfortunately does not work as expected with `rinna/japanese-roberta-base` since the `padding_idx` of the corresponding tokenizer is not `0`. So please be sure to constrcut the `position_ids` by yourself and make it start from position id `0`.
 ## Example
 Here is an example by to illustrate how our model works as a masked language model. Notice the difference between running the following code example and running the Huggingface Inference API.
 # convert to tensor
 import torch
+token_tensor = torch.LongTensor([token_ids])
+# provide position ids explicitly
+position_ids = list(range(0, token_tensor.size(1)))
+print(position_ids)  # output: [0, 1, 2, 3, 4, 5, 6, 7, 8]
+position_id_tensor = torch.LongTensor([position_ids])
 # get the top 10 predictions of the masked token
 with torch.no_grad():
+    outputs = model(input_ids=token_tensor, position_ids=position_id_tensor)
     predictions = outputs[0][0, masked_idx].topk(10)
 for i, index_t in enumerate(predictions.indices):
     print(i, token)
 """
+0 総会
+1 サミット
+2 ワールドカップ
+3 フェスティバル
+4 大会
+5 オリンピック
 6 全国大会
+7 党大会
+8 イベント
+9 世界選手権
 """
 ~~~~