Edit model card

Model Card for KartonBERT-USE-base-v1

This universal sentence encoder model is designed to convert text content into a 768-float vector space, ensuring an effective representation. It aims to be proficient in tasks involving sentence / document similarity.

Despite its small size (104 million parameters only), the model maintains a high level of performance. It uses a lowercase-optimized tokenizer with a vocabulary size of 23,000 tokens. This balance between compactness and effectiveness allows the model to deliver strong results in text encoding tasks, ensuring both speed and accuracy in real-time applications.

Model Description

How to Get Started with the Model

Use the code below to get started with the model.

Using Sentence-Transformers

You can use the model with sentence-transformers:

pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('OrlikB/KartonBERT-USE-base-v1')

text_1 = 'Jestem wielkim fanem opakowań tekturowych'
text_2 = 'Bardzo podobają mi się kartony'

embeddings_1 = model.encode(text_1, normalize_embeddings=True)
embeddings_2 = model.encode(text_2, normalize_embeddings=True)

similarity = embeddings_1 @ embeddings_2.T
print(similarity)

Using HuggingFace Transformers

from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np

def encode_text(text):
    encoded_input = tokenizer(text, padding=True, truncation=True, return_tensors='pt', max_length=512)
    with torch.no_grad():
        model_output = model(**encoded_input)
        sentence_embeddings = model_output[0][:, 0]
        sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1)
    return  sentence_embeddings.squeeze().numpy()

cosine_similarity = lambda a, b: np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))


tokenizer = AutoTokenizer.from_pretrained('OrlikB/KartonBERT-USE-base-v1')
model = AutoModel.from_pretrained('OrlikB/KartonBERT-USE-base-v1')
model.eval()

text_1 = 'Jestem wielkim fanem opakowań tekturowych'
text_2 = 'Bardzo podobają mi się kartony'

embeddings_1 = encode_text(text_1)
embeddings_2 = encode_text(text_2)

print(cosine_similarity(embeddings_1, embeddings_2))

*Note: You can use the encode_text function for demonstration purposes. For the best experience, it's recommended to process text in batches.

Evaluation

MTEB for Polish Language

Rank Model Model Size (Million Parameters) Memory Usage (GB, fp32) Embedding Dimensions Max Tokens Average (26 datasets) Classification Average (7 datasets) Clustering Average (1 dataset) PairClassification Average (4 datasets) Retrieval Average (11 datasets) STS Average (3 datasets)
1 bge-multilingual-gemma2 9242 34.43 3584 8192 70 77.99 50.29 89.62 59.41 70.64
2 gte-Qwen2-7B-instruct 7613 28.36 3584 131072 67.86 77.84 51.36 88.48 54.69 70.86
3 gte-Qwen2-1.5B-instruct 1776 6.62 1536 131072 64.04 72.29 44.59 84.87 51.88 68.12
4 jina-embeddings-v3 572 2.13 1024 8194 63.97 70.81 43.66 83.70 51.89 72.77
5 jina-embeddings-v3 572 2.13 1024 8194 63.97 70.81 43.66 83.70 51.89 72.77
6 mmlw-roberta-large 435 1.62 1024 514 63.23 66.39 31.16 89.13 52.71 70.59
7 KartonBERT-USE-base-v1 104 0.39 768 512 61.67 67.57 29.88 87.04 49.14 70.65
8 mmlw-e5-large 560 2.09 1024 514 61.17 61.07 30.62 85.90 52.63 69.98
9 mmlw-roberta-base 124 0.46 768 514 61.05 62.92 33.08 88.14 49.92 70.70
10 multilingual-e5-large 560 2.09 1024 514 60.08 63.82 33.88 85.50 48.98 66.91
11 mmlw-e5-base 278 1.04 768 514 59.71 59.52 30.25 86.16 50.06 70.13
12 gte-multilingual-base 305 1.14 768 8192 58.22 60.15 33.67 85.45 46.40 68.92
13 st-polish-kartonberta-base-alpha-v1 124 0.46 768 514 56.92 60.44 32.85 87.92 42.19 69.47

More Information

If I have spare computing resources (GPU), I may improve the quality of the model by further training.

Downloads last month
4,159
Safetensors
Model size
104M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Evaluation results