GLuCoSE-base-ja-v2 / README.md
yano0's picture
Update README.md
a6cbdf4 verified
metadata
language:
  - ja
library_name: sentence-transformers
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
metrics:
  - pearson_cosine
  - spearman_cosine
  - pearson_manhattan
  - spearman_manhattan
  - pearson_euclidean
  - spearman_euclidean
  - pearson_dot
  - spearman_dot
  - pearson_max
  - spearman_max
widget: []
pipeline_tag: sentence-similarity
datasets:
  - hpprc/emb
  - hpprc/mqa-ja
  - google-research-datasets/paws-x
base_model: pkshatech/GLuCoSE-base-ja
license: apache-2.0

GLuCoSE v2

This model is a general Japanese text embedding model, excelling in retrieval tasks. It can run on CPU and is designed to measure semantic similarity between sentences, as well as to function as a retrieval system for searching passages based on queries.

Key features:

  • Specialized for retrieval tasks, it demonstrates the highest performance among similar size models in MIRACL and other tasks .
  • Optimized for Japanese text processing
  • Can run on CPU

During inference, the prefix "query: " or "passage: " is required. Please check the Usage section for details.

Model Description

The model is based on GLuCoSE and fine-tuned through distillation using several large-scale embedding models and multi-stage contrastive learning.

  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 768 tokens
  • Similarity Function: Cosine Similarity

Usage

Direct Usage (Sentence Transformers)

You can perform inference using SentenceTransformer with the following code:

from sentence_transformers import SentenceTransformer
import torch.nn.functional as F

# Download from the 🤗 Hub
model = SentenceTransformer("pkshatech/GLuCoSE-base-ja-v2")

# Each input text should start with "query: " or "passage: ".
# For tasks other than retrieval, you can simply use the "query: " prefix.
sentences = [
    'query: PKSHAはどんな会社ですか?',
    'passage: 研究開発したアルゴリズムを、多くの企業のソフトウエア・オペレーションに導入しています。',
    'query: 日本で一番高い山は?',
    'passage: 富士山(ふじさん)は、標高3776.12 m、日本最高峰(剣ヶ峰)の独立峰で、その優美な風貌は日本国外でも日本の象徴として広く知られている。',
]
embeddings = model.encode(sentences,convert_to_tensor=True)
print(embeddings.shape)
# [4, 768]

# Get the similarity scores for the embeddings
similarities = F.cosine_similarity(embeddings.unsqueeze(0), embeddings.unsqueeze(1), dim=2)
print(similarities)
# [[1.0000, 0.6050, 0.4341, 0.5537],
# [0.6050, 1.0000, 0.5018, 0.6815],
# [0.4341, 0.5018, 1.0000, 0.7534],
# [0.5537, 0.6815, 0.7534, 1.0000]]

Direct Usage (Transformers)

You can perform inference using Transformers with the following code:

import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel

def mean_pooling(last_hidden_states: Tensor,attention_mask: Tensor) -> Tensor:
    emb = last_hidden_states * attention_mask.unsqueeze(-1)
    emb = emb.sum(dim=1) / attention_mask.sum(dim=1).unsqueeze(-1)
    return emb

# Download from the 🤗 Hub
tokenizer = AutoTokenizer.from_pretrained("pkshatech/GLuCoSE-base-ja-v2")
model = AutoModel.from_pretrained("pkshatech/GLuCoSE-base-ja-v2")

# Each input text should start with "query: " or "passage: ".
# For tasks other than retrieval, you can simply use the "query: " prefix.
sentences = [
    'query: PKSHAはどんな会社ですか?',
    'passage: 研究開発したアルゴリズムを、多くの企業のソフトウエア・オペレーションに導入しています。',
    'query: 日本で一番高い山は?',
    'passage: 富士山(ふじさん)は、標高3776.12 m、日本最高峰(剣ヶ峰)の独立峰で、その優美な風貌は日本国外でも日本の象徴として広く知られている。',
]

# Tokenize the input texts
batch_dict = tokenizer(sentences, max_length=512, padding=True, truncation=True, return_tensors='pt')

outputs = model(**batch_dict)
embeddings = mean_pooling(outputs.last_hidden_state, batch_dict['attention_mask'])
print(embeddings.shape)
# [4, 768]

# Get the similarity scores for the embeddings
similarities = F.cosine_similarity(embeddings.unsqueeze(0), embeddings.unsqueeze(1), dim=2)
print(similarities)
# [[1.0000, 0.6050, 0.4341, 0.5537],
# [0.6050, 1.0000, 0.5018, 0.6815],
# [0.4341, 0.5018, 1.0000, 0.7534],
# [0.5537, 0.6815, 0.7534, 1.0000]]

Training Details

The fine-tuning of GLuCoSE v2 is carried out through the following steps:

Step 1: Ensemble distillation

Step 2: Contrastive learning

  • Triplets were created from JSNLI, MNLI, PAWS-X, JSeM and Mr.TyDi and used for training.
  • This training aimed to improve the overall performance as a sentence embedding model.

Step 3: Search-specific contrastive learning

Benchmarks

Retrieval

Evaluated with MIRACL-ja, JQARA , JaCWIR and MLDR-ja.

Model Size MIRACL
Recall@5
JQaRA
nDCG@10
JaCWIR
MAP@10
MLDR
nDCG@10
intfloat/multilingual-e5-large 0.6B 89.2 55.4 87.6 29.8
cl-nagoya/ruri-large 0.3B 78.7 62.4 85.0 37.5
intfloat/multilingual-e5-base 0.3B 84.2 47.2 85.3 25.4
cl-nagoya/ruri-base 0.1B 74.3 58.1 84.6 35.3
pkshatech/GLuCoSE-base-ja 0.1B 53.3 30.8 68.6 25.2
GLuCoSE v2 0.1B 85.5 60.6 85.3 33.8

Note: Results for OpenAI small embeddings in JQARA and JaCWIR are quoted from the JQARA and JaCWIR.

JMTEB

Evaluated with JMTEB. The average score is macro-average.

Model Size Avg. Retrieval STS Classification Reranking Clustering PairClassification
OpenAI/text-embedding-3-small - 69.18 66.39 79.46 73.06 92.92 51.06 62.27
OpenAI/text-embedding-3-large - 74.05 74.48 82.52 77.58 93.58 53.32 62.35
intfloat/multilingual-e5-large 0.6B 70.90 70.98 79.70 72.89 92.96 51.24 62.15
cl-nagoya/ruri-large 0.3B 73.31 73.02 83.13 77.43 92.99 51.82 62.29
intfloat/multilingual-e5-base 0.3B 68.61 68.21 79.84 69.30 92.85 48.26 62.26
cl-nagoya/ruri-base 0.1B 71.91 69.82 82.87 75.58 92.91 54.16 62.38
pkshatech/GLuCoSE-base-ja 0.1B 67.29 59.02 78.71 76.82 91.90 49.78 66.39
GLuCoSE v2 0.1B 72.23 73.36 82.96 74.21 93.01 48.65 62.37

Note: Results for OpenAI embeddings and multilingual-e5 models are quoted from the JMTEB leaderboard. Results for ruri are quoted from the cl-nagoya/ruri-base model card.

Authors

Chihiro Yano, Mocho Go, Hideyuki Tachibana, Hiroto Takegawa, Yotaro Watanabe

License

This model is published under the Apache License, Version 2.0.