How to Fine-Tune Custom Embedding Models Using AutoTrain
AutoTrain is a powerful no-code tool that allows you to train or fine-tune many different state of the art models including Sentence Transformer models on your own datasets with ease. Here’s a simple guide to get you started on fine-tuning your custom embedding models using AutoTrain.
Types of Sentence Transformer Fine-Tuning in AutoTrain
AutoTrain supports various types of sentence transformer fine-tuning tasks:
- pair: Dataset with two sentences: anchor and positive.
- pair_class: Dataset with two sentences: premise and hypothesis with a target label.
- pair_score: Dataset with two sentences: sentence1 and sentence2 with a target score.
- triplet: Dataset with three sentences: anchor, positive, and negative.
- qa: Dataset with two sentences: query and answer.
Data Format
AutoTrain accepts data in CSV or JSONL format. You can also use a dataset from Hugging Face Hub. Let’s look at the format for each task.
pair:
| anchor | positive | |---------------------------------|---------------------------------| | hello | hi | | how are you | I am fine | | What is your name? | My name is Abhishek | | Which is the best programming language? | Python |
pair_class:
| premise | hypothesis | label | |---------------------------------|---------------------------------|-------| | hello | hi | 1 | | how are you | I am fine | 0 | | What is your name? | My name is Abhishek | 1 | | Which is the best programming language? | Python | 1 |
pair_score:
| sentence1 | sentence2 | score | |---------------------------------|---------------------------------|-------| | hello | hi | 0.8 | | how are you | I am fine | 0.2 | | What is your name? | My name is Abhishek | 0.9 | | Which is the best programming language? | Python | 0.7 |
triplet:
| anchor | positive | negative | |---------------------------------|---------------------------------|---------------------------------| | hello | hi | bye | | how are you | I am fine | I am not fine | | What is your name? | My name is Abhishek | Whats it to you? | | Which is the best programming language? | Python | Javascript |
qa:
| query | answer | |---------------------------------|---------------------------------| | hello | hi | | how are you | I am fine | | What is your name? | My name is Abhishek | | Which is the best programming language? | Python |
Column Mapping
Column mapping is crucial for AutoTrain to understand the role of each column in your dataset. Here’s how you can map columns for each task:
Task | Column Mapping |
---|---|
pair | {"sentence1_column": "anchor", "sentence2_column": "positive"} |
pair_class | {"sentence1_column": "premise", "sentence2_column": "hypothesis", "target_column": "label"} |
pair_score | {"sentence1_column": "sentence1", "sentence2_column": "sentence2", "target_column": "score"} |
triplet | {"sentence1_column": "anchor", "sentence2_column": "positive", "sentence3_column": "negative"} |
qa | {"sentence1_column": "query", "sentence2_column": "answer"} |
Tips for Accurate Mapping
- Verify Column Names: Ensure that the names used in the mapping dictionary match those in your dataset.
- Update Mappings for New Datasets: Each dataset might need unique mappings based on its structure.
By following these steps and ensuring proper data formatting and column mapping, you can effectively fine-tune custom embedding models using AutoTrain.
Fine-Tuning Using CLI with AutoTrain
Once you have your data and column mappings set up, you can proceed to fine-tune your Sentence Transformer model using the AutoTrain CLI. Below is an example configuration for a triplet
task.
Configuration File (config.yml)
Create a config.yml
file with the following content:
task: sentence-transformers:triplet
base_model: microsoft/mpnet-base
project_name: autotrain-st-triplet
log: tensorboard
backend: local
data:
path: sentence-transformers/all-nli
train_split: triplet:train
valid_split: triplet:dev
column_mapping:
sentence1_column: anchor
sentence2_column: positive
sentence3_column: negative
params:
max_seq_length: 512
epochs: 5
batch_size: 8
lr: 2e-5
optimizer: adamw_torch
scheduler: linear
gradient_accumulation: 1
mixed_precision: fp16
hub:
username: ${HF_USERNAME}
token: ${HF_TOKEN}
push_to_hub: true
Steps to Fine-Tune Using CLI
Install AutoTrain-Advanced: Ensure you have
autotrain-advanced
installed. You can install it via pip:pip install autotrain-advanced
Prepare Configuration: Save the above YAML configuration as
config.yml
.Run Training: Execute the following command to start the fine-tuning process:
autotrain --config config.yml
Configuration for Different Tasks
Here’s how the config.yml
file will change for different tasks:
Pair Task
task: sentence-transformers:pair
data:
column_mapping:
sentence1_column: anchor
sentence2_column: positive
Pair Class Task
task: sentence-transformers:pair_class
data:
column_mapping:
sentence1_column: premise
sentence2_column: hypothesis
target_column: label
Pair Score Task
task: sentence-transformers:pair_score
data:
column_mapping:
sentence1_column: sentence1
sentence2_column: sentence2
target_column: score
QA Task
task: sentence-transformers:qa
data:
column_mapping:
sentence1_column: query
sentence2_column: answer
Please note for all tasks, you need different types of datasets and column mappings will change depending on the column names available in your dataset. Here's and example dataset with most types of trainers above.
Train on Hugging Face Spaces via AutoTrain UI
If you want to train on Hugging Face Spaces using the AutoTrain UI, you can click here, attach appropriate hardware and create the space, choose the correct task. The resulting UI will look like:
Clicking on Start Training will start the training process :)
P.S. you can do the same on Google Colab! Check out the GitHub Repo to learn more
Summary
By following these steps and ensuring proper data formatting and column mapping, you can effectively fine-tune custom embedding models using AutoTrain, whether through the web interface or the command-line interface. With a straightforward configuration file, you can leverage AutoTrain to train powerful Sentence Transformer models tailored to your specific needs.
Happy training!