thekop79/dexml_eurlex-4k_hnm

Distilbert encoder models trained on European law document tagging dataset (EURLex-4K) using DEXML with cross-batch mix negative sampling originally adapted from (Dual Encoder for eXtreme Multi-Label classification, ICLR'24) method.

Inference Usage (Sentence-Transformers)

With sentence-transformers installed you can use this model as following:

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('quicktensor/dexml_eurlex-4k')
embeddings = model.encode(sentences)
print(embeddings)

Usage (HuggingFace Transformers)

With huggingface transformers you only need to be a bit careful with how you pool the transformer output to get the embedding, you can use this model as following;

from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F
pooler = lambda x: F.normalize(x[:, 0, :], dim=-1) # Choose CLS token and normalize
sentences = ["This is an example sentence", "Each sentence is converted"]
tokenizer = AutoTokenizer.from_pretrained('quicktensor/dexml_eurlex-4k')
model = AutoModel.from_pretrained('quicktensor/dexml_eurlex-4k')
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
    embeddings = pooler(model(**encoded_input))
print(embeddings)

Cite the original authors

If you found this model helpful, please cite our work as:

@InProceedings{DEXML,
  author    = "Gupta, N. and Khatri, D. and Rawat, A-S. and Bhojanapalli, S. and Jain, P. and Dhillon, I.",
  title     = "Dual-encoders for Extreme Multi-label Classification",
  booktitle = "International Conference on Learning Representations",
  month     = "May",
  year      = "2024"
}