Token or Sentence Embeddings
In the past few week, I've been messing around with SBERTs.
Is it possible to create embeddings using bloom to perform Semantic Search?
If you download the model and run it yourself yes, but it's not provided through the model widget API
Hey! I created an embedding model of BLOOM 1B3 here that may be of interest to you: https://huggingface.co/bigscience-catalogue-lm-data/sgpt-nli-bloom-1b3
If it is of interest, I can create a similar fine-tuned embedding model for this 176B model, but do note that embeddings would be very expensive to retrieve. Further storing them would require a lot of space if we don't add a linear layer to reduce their dim.
Else, you can ofc load the model in HF with AutoModel and produce embeddings, but they will not perform well without fine-tuning like is done for the model above.
Thank you for the feedback!
Versions:
-transformers: 4.20.1
-sentence-transformers: 2.2.2
I could not load the model using:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("bigscience-catalogue-lm-data/sgpt-nli-bloom-1b3")
,but it ended up working by creating the SentenceTransformer model using the word_embedding model.
from sentence_transformers import SentenceTransformer, models, evaluation
word_embedding_model = models.Transformer('bigscience-catalogue-lm-data/sgpt-nli-bloom-1b3', max_seq_length=256)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
model2 = SentenceTransformer(modules=[word_embedding_model, pooling_model])
I was trying to fine-tune it like I did previously on sentence transformers and It did not work.
#load dataset
from datasets import load_dataset
dataset = load_dataset("assin")
from sentence_transformers import SentenceTransformer, InputExample, losses, models, evaluation
from torch.utils.data import DataLoader
train_examples = [InputExample(texts=[texts['premise'], texts['hypothesis']], label=texts['relatedness_score']/5) for texts in dataset['train']]
#Define your train dataset, the dataloader and the train loss
train_dataloader = DataLoader(train_examples, batch_size=32)
train_loss = losses.CosineSimilarityLoss(model)
#Tune the model
model2.fit(train_objectives=[(train_dataloader, train_loss)], epochs=1, warmup_steps=0.1*len(train_dataloader))
Do I need a specific version ? The first example should have loaded correctly the model, right?
Am I fine tuning it the wrong way with this type of model?
Thanks in advance
To load it you need to either use this branch of SentenceTransformers: https://github.com/UKPLab/sentence-transformers/pull/1613 or install sentence transformers from this repo: https://github.com/Muennighoff/sgpt/tree/main/biencoder/nli_msmarco
By loading it via one of the ways above, your problem will probably be solved. Note though that the model is already fine-tuned. If you want to fine-tune further, you probably also need to use one of the two repos above as the pooling of the model is not implemented in sentence transformers. If you have a lot of data for your use-case it might make more sense to just fine-tune from scratch again.
Hope this helps 🤗
Thanks, I believe it works. I just need to have a GPU available to train and complete the fine-tuning.
I just have one other question.
On SBERT models, we can perform "Domain Adaptation" of the BERT model, before creating a SBERT one.
It would allow our model to be more familiarized with our specific context.
On my side, I only have access to raw texts.
I was wondering if I could perform something similar to this model, with relatively scarce computational resources.
Also, in order to fine-tune such a model, what GPU size do I need?
If you don't have aligned texts i.e. just raw, you probably want to do fine-tune the BERT model not the SentenceTransformer? If you're texts are aligned then for the model above you can probably fine-tune it with 1x A100 40GB or even 1xV100 36GB. If you have more, training will be faster via data parallelism. I'd recommend using GradCache to get a larger batch size as implemented in this repo: https://github.com/Muennighoff/sgpt/tree/main/biencoder/nli_msmarco
Yes, it would be the BERT in that case.
Thanks!
Sure, if you want to fine-tune BERT you can do that easily in sentence-transformers with mean pooling. For fine-tuning bloom / other auto-regressive models for embeddings, it may be better to use the repo I sent above with weighted mean pooling (https://github.com/Muennighoff/sgpt/tree/main/biencoder/nli_msmarco).
Closing this issue, feel free to reopen if you still have questions 👻