Quora Sentence Similarity

This is a sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.

For more information on how it was created, check out the following link: https://github.com/DunnBC22/NLP_Projects/blob/main/Semantic_Similarity/Semantic%20Similarity-large.ipynb

Usage (Sentence-Transformers)

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

Then you can use the model like this:

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('{MODEL_NAME}')
embeddings = model.encode(sentences)
print(embeddings)

Evaluation Results

Metric Measure Value Notes
Accuracy Cosine-Similarity 88.72 Threshold: 0.8397
F1 Cosine-Similarity 85.22 Threshold: 0.8223
Precision Cosine-Similarity 80.72 -
Recall Cosine-Similarity 90.25 -
Average Precision Cosine-Similarity 89.75 -
Accuracy Manhattan-Distance 88.71 Threshold: 12.4351
F1 Manhattan-Distance 85.22 Threshold: 13.2209
Precision Manhattan-Distance 80.58 -
Recall Manhattan-Distance 90.42 -
Average Precision Manhattan-Distance 89.74 -
Accuracy Euclidean-Distance 88.72 Threshold: 0.5662
F1 Euclidean-Distance 85.22 Threshold: 0.5962
Precision Euclidean-Distance 80.72 -
Recall Euclidean-Distance 90.25 -
Average Precision Euclidean-Distance 89.75 -
Accuracy Dot-Product 88.72 Threshold: 0.8397
F1 Dot-Product 85.22 Threshold: 0.8223
Precision Dot-Product 80.72 -
Recall Dot-Product 90.25 -
Average Precision Dot-Product 89.75 -

For an automated evaluation of this model, see the Sentence Embeddings Benchmark: https://seb.sbert.net

Training

The model was trained with the parameters:

DataLoader:

torch.utils.data.dataloader.DataLoader of length 5055 with parameters:

{'batch_size': 64, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}

Loss:

sentence_transformers.losses.OnlineContrastiveLoss.OnlineContrastiveLoss

Parameters of the fit()-Method:

{
    "epochs": 1,
    "evaluation_steps": 0,
    "evaluator": "sentence_transformers.evaluation.BinaryClassificationEvaluator.BinaryClassificationEvaluator",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
    "optimizer_params": {
        "lr": 2e-05
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 20,
    "weight_decay": 0.01
}

Potential Improvements

One way to improve the results of this model is to use a larger checkpoint of T5. This was trained with the T5-large checkpoint.

The larger checkpoints are:

Checkpoint # of Train Params
T5-Base 220 Million
T5-Large 770 Million*
T5-3B 3 Billion
T5-11B 11 Billion

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 34, 'do_lower_case': False}) with Transformer model: T5EncoderModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Dense({'in_features': 1024, 'out_features': 768, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
  (3): Normalize()
)

Citing & Authors

Dataset Source: https://www.kaggle.com/datasets/quora/question-pairs-dataset

Downloads last month
15
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.