How to use transformer to match a text paragraph to existing categories description?
Input (given in the form of a sentence/paragraph)= “an application that convert speech to text using AI”
Desired Output: category 1 match = 0.1 category 2 match = 0.9 …
Category 1 description = “medical equipment includes wearable devices”
Category 2 description =“software for SaaS or downloadable”
…
In other words I am trying to ask "hugging face" to learn the description of various categories that are given in sentences or item lists.
Then I am asking "hugging face" to classify a paragraph description (my input) to the categories that best match this input description.
learning_rate=5e-5
weight_decay=0.010
evaluation_strategy="epoch"
The adjustments I picked for matching a text paragraph to existing categories description.
pip install transformers
To use the HF Transformers library
Then clean the data
{
"text": "an application that convert speech to text using AI",
"label": 2
},
{
"text": "medical equipment includes wearable devices",
"label": 1
},
{
"text": "software for SaaS or downloadable",
"label": 2
}
Use a script like the one below:
from transformers import BertTokenizer, BertForSequenceClassification, TrainingArguments, Trainer
#import from this model
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=3)
def tokenize(batch):
return tokenizer(batch["text"], padding=True, truncation=True, max_length=512)
train_dataset = train_dataset.map(tokenize, batched=True, batch_size=None)
val_dataset = val_dataset.map(tokenize, batched=True, batch_size=None)
training_args = TrainingArguments(
output_dir="results",
num_train_epochs=5,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
logging_dir="logs",
logging_steps=100,
save_steps=1000,
evaluation_strategy="epoch",
learning_rate=5e-5,
weight_decay=0.01,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=val_dataset,
)
trainer.train()
Save it
trainer.save_model("fine-tuned-model")
# Load the fine-tuned model
tokenizer = BertTokenizer.from_pretrained("fine-tuned-model")
model = BertForSequenceClassification.from_pretrained("fine-tuned-model")
Use it
def predict(text):
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
category = logits.argmax(dim=-1).item()
return category
text = "an application that convert speech to text using AI"
category = predict(text)
print(f"Category match: {category}")
in case this is no beuno for your case, here's another good set of parameters you can try:
# Set training arguments
training_args = TrainingArguments(
output_dir="results",
num_train_epochs=4,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
logging_dir="logs",
logging_steps=1000,
save_steps=5000,
evaluation_strategy="steps",
learning_rate=2e-5,
weight_decay=0.01,
warmup_steps=500,
load_best_model_at_end=True,
metric_for_best_model="accuracy",
greater_is_better=True,
)
cool