embeddings and need for fine-tuning

#97
by falloutats - opened

Hello everyone,

I'm working on a project where I have an Excel file containing 60,000 descriptions, each assigned one of 30 unique codes. My goal is to develop a system that can take new descriptions from users and assign the most relevant code based on the existing dataset.

I want to fine-tune the BGE M3-Embedding model using PyTorch for this classification task. However, I'm having trouble finding specific examples or guidance on how to do this.

Here are the details:

Dataset: 60,000 text descriptions with one of 30 unique codes as labels.
Objective: Fine-tune the BGE M3-Embedding model to classify new descriptions into one of the 30 codes.
Challenge: Lack of specific examples or tutorials on fine-tuning this particular model using PyTorch.
My questions are:

How can I fine-tune the BGE M3-Embedding model using PyTorch for multi-class classification?
Specifically, how do I modify the model to include a classification head?
What is the best way to prepare my dataset for training?
Tokenization, encoding labels, handling imbalanced classes, etc.
Are there any code examples or resources that demonstrate this process?
Step-by-step guides or tutorials would be very helpful.
What I've tried so far:

Loaded the pre-trained BGE M3-Embedding model and tokenizer using Hugging Face Transformers.
Attempted to add a custom classification head on top of the embedding model.
Encountered issues with the validity of the code and model outputs.
Any advice, code snippets, or resources would be greatly appreciated!

Thank you in advance for your help!

falloutats changed discussion title from embossings and need for fine-tuning to embeddings and need for fine-tuning

Sign up or log in to comment