Text Classification
PyTorch
Safetensors
English
eurovoc
Inference Endpoints

Version 1.0 Annoucement

#1
by scampion - opened
European Parliament org
edited Jun 3

📢 Announcement: Eurovoc Multilabel & Multilanguage Classifier Release! 🇪🇺

🌟 Introduction: We are delighted to introduce you to Eurovoc's Multilabel Multilingual Classifier, a tool that has been implemented through extensive training on the EuroHPC Meluxina cluster. It is designed to classify institutional documents in the 24 EU languages. In short, It extracts keywords restricted to the Eurovoc vocabulary from text documents.

🚀 Technical Marvel: Leveraging the BERT Deep Neural Network and trained on a significant corpus of 3.2 million documents, this model demonstrates efficient processing on standard hardware, managing diverse language tasks with quick response times.

⚙️ Under the Hood: Utilizing EUBERT architecture, this model thrives on a lean structure of less than 100 million parameters, offering rapid deployment without GPU acceleration.

📊 Performance Metrics: Outperforming prior benchmarks, achieving a Micro F1 score of 0.8188 on the Eurovoc Dataset version 23.08, setting new standards in legal text classification.

🔍 Usage: Accessible through the Hugging Face inference endpoint for European Parliament teams, effortlessly integrating into production environments for diverse language classification tasks. It is also published in open source.

📚 Notable achievements: Compared with other publications in the field, this is a new state-of-the-art industry standard in terms of calculated performance.

🌐 Community Engagement: Embracing the spirit of collaboration, we encourage community feedback and contributions to enhance this tool’s capabilities for diverse language tasks.

For more information https://huggingface.co/EuropeanParliament/eurovoc_eu

The Eurovoc Multilabel Classifier marks a step forward in our efforts to contribute to multilingual legal document analysis. We're committed to refining and improving this tool while remaining open to community input to better serve the diverse linguistic needs of the EU.

This model can be test here: https://huggingface.co/spaces/EuropeanParliament/Eurovoc (select multilingual)

☝️Members of European Parliament can use it directly without deployment thanks to the API available at following endpoint
https://wu0h9yuxkbna7e7c.eu-west-1.aws.endpoints.huggingface.cloud
hosted by AWS ☁️ in Ireland 🇮🇪.

To finish, a quick benchmark based on the 645 documents (mostly in English but also with French and German) published by the Publication Office in September and October and therefore unknown to the model is presented below.

Metrics\Version PyEuroVoc Initial poc Legal Bert 🇬🇧 23.08 EUBERT 🇪🇺 23.08
NDCG@3 0.5013 0.5239 0.7071 0.7292
NDCG@5 0.4325 0.4583 0.6353 0.6655
NDCG@10 0.3891 0.4253 0.5863 0.6137

Your comments, patches, benchmark results and any other contributions are obviously welcome, so that together we can improve an open source solution for our users.

European Parliament org

Our latest evaluation on documents published between September and November 2023, +-45000 documents :
F1 score: 0.5853
NDCG@3: 0.8101
NDCG@5: 0.7429
NDCG@10: 0.6936

European Parliament org

Our latest evaluation on documents published between September and November 2023, +-45000 documents with PyEuroVoc
F1 score: 0.2972
NDCG@3: 0.5012
NDCG@5: 0.4422
NDCG@10: 0.4106

Sign up or log in to comment