SetFit with BAAI/bge-small-en-v1.5

This is a SetFit model that can be used for Text Classification. This SetFit model uses BAAI/bge-small-en-v1.5 as the Sentence Transformer embedding model. A LogisticRegression instance is used for classification.

The model has been trained using an efficient few-shot learning technique that involves:

Fine-tuning a Sentence Transformer with contrastive learning.
Training a classification head with features from the fine-tuned Sentence Transformer.

Model Details

Model Description

Model Type: SetFit
Sentence Transformer body: BAAI/bge-small-en-v1.5
Classification head: a LogisticRegression instance
Maximum Sequence Length: 512 tokens
Number of Classes: 7 classes

Model Sources

Repository: SetFit on GitHub
Paper: Efficient Few-Shot Learning Without Prompts
Blogpost: SetFit: Efficient Few-Shot Learning Without Prompts

Model Labels

Label	Examples
Aggregation	'Please show med CostVariance_Actual_vs_Forecast.' 'Get me data_asset_001_kpm group by metrics.' 'Provide data_asset_kpi_cf group by quarter.'
Tablejoin	'Join data_asset_kpi_cf with data_asset_001_kpm tables.' 'Could you link the Products and Orders tables to track sales trends for different product categories?' 'Can I have a merge of income statement and key performance metrics tables?'
Lookup	"Filter by the 'Sales' department and show me the employees." "Filter by the 'Toys' category and get me the product names." 'Can you get me the products with a price above 100?'
Rejection	"Let's avoid generating additional reports." "I'd rather not filter this dataset." "I'd prefer not to apply any filters."
Lookup_1	'Show me key income statement metrics.' 'can I have kpm table' 'Retrieve data_asset_kpi_ma_product records.'
Generalreply	"Hey! It's going pretty well, thanks for asking. How about yours?" 'Not much, just taking it one day at a time. How about you?' "'What is your favorite quote?'"
Viewtables	'What are the table names that relate to customer service in the starhub_data_asset database?' 'What tables are available in the starhub_data_asset database that can be joined to track user behavior?' 'What are the tables that are available for analysis in the starhub_data_asset database?'

Evaluation

Metrics

Label	Accuracy
all	0.9915

Uses

Direct Use for Inference

First install the SetFit library:

pip install setfit

Then you can load this model and run inference.

from setfit import SetFitModel

# Download from the 🤗 Hub
model = SetFitModel.from_pretrained("nazhan/bge-small-en-v1.5-brahmaputra-iter-10-3rd")
# Run inference
preds = model("Show me average asset value.")

Training Details

Training Set Metrics

Training set	Min	Median	Max
Word count	1	8.7839	62

Label	Training Sample Count
Tablejoin	127
Rejection	76
Aggregation	281
Lookup	59
Generalreply	71
Viewtables	75
Lookup_1	158

Training Hyperparameters

batch_size: (16, 16)
num_epochs: (1, 1)
max_steps: 2450
sampling_strategy: oversampling
body_learning_rate: (2e-05, 1e-05)
head_learning_rate: 0.01
loss: CosineSimilarityLoss
distance_metric: cosine_distance
margin: 0.25
end_to_end: False
use_amp: False
warmup_proportion: 0.1
seed: 42
eval_max_steps: -1
load_best_model_at_end: True

Training Results

Epoch	Step	Training Loss	Validation Loss
0.0000	1	0.2317	-
0.0025	50	0.2478	-
0.0050	100	0.2213	-
0.0075	150	0.0779	-
0.0100	200	0.1089	-
0.0125	250	0.0372	-
0.0149	300	0.0219	-
0.0174	350	0.0344	-
0.0199	400	0.012	-
0.0224	450	0.0049	-
0.0249	500	0.0041	-
0.0274	550	0.0083	-
0.0299	600	0.0057	-
0.0324	650	0.0047	-
0.0349	700	0.0022	-
0.0374	750	0.0015	-
0.0399	800	0.0032	-
0.0423	850	0.002	-
0.0448	900	0.0028	-
0.0473	950	0.0017	-
0.0498	1000	0.0017	-
0.0523	1050	0.0027	-
0.0548	1100	0.0022	-
0.0573	1150	0.0018	-
0.0598	1200	0.001	-
0.0623	1250	0.002	-
0.0648	1300	0.001	-
0.0673	1350	0.0013	-
0.0697	1400	0.0012	-
0.0722	1450	0.0018	-
0.0747	1500	0.0012	-
0.0772	1550	0.0016	-
0.0797	1600	0.0012	-
0.0822	1650	0.0016	-
0.0847	1700	0.0027	-
0.0872	1750	0.0014	-
0.0897	1800	0.0011	-
0.0922	1850	0.0011	-
0.0947	1900	0.0012	-
0.0971	1950	0.0014	-
0.0996	2000	0.0014	-
0.1021	2050	0.0015	-
0.1046	2100	0.0009	-
0.1071	2150	0.0015	-
0.1096	2200	0.0013	-
0.1121	2250	0.0013	-
0.1146	2300	0.001	-
0.1171	2350	0.0017	-
0.1196	2400	0.0013	-
0.1221	2450	0.0008	0.0323

The bold row denotes the saved checkpoint.

Framework Versions

Python: 3.11.9
SetFit: 1.0.3
Sentence Transformers: 2.7.0
Transformers: 4.42.4
PyTorch: 2.4.0+cu121
Datasets: 2.21.0
Tokenizers: 0.19.1

Citation

BibTeX

@article{https://doi.org/10.48550/arxiv.2209.11055,
    doi = {10.48550/ARXIV.2209.11055},
    url = {https://arxiv.org/abs/2209.11055},
    author = {Tunstall, Lewis and Reimers, Nils and Jo, Unso Eun Seo and Bates, Luke and Korat, Daniel and Wasserblat, Moshe and Pereg, Oren},
    keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
    title = {Efficient Few-Shot Learning Without Prompts},
    publisher = {arXiv},
    year = {2022},
    copyright = {Creative Commons Attribution 4.0 International}
}

nazhan
/

bge-small-en-v1.5-brahmaputra-iter-10-3rd