nikitast
/

lang-segmentation-roberta

Token Classification

language classification

text segmentation

Inference Endpoints

Model card Files Files and versions Community

Edit model card

RoBERTa for Multilabel Language Segmentation

Training

RoBERTa fine-tuned on small parts of Open Subtitles, Oscar and Tatoeba datasets (~9k samples per language).

Implemented heuristic algorithm for multilingual training data creation with generation of target masks- https://github.com/n1kstep/lang-classifier

data source	language
open_subtitles	ka, he, en, de
oscar	be, kk, az, hu
tatoeba	ru, uk

Validation

The metrics obtained from validation on the another part of dataset (~1k samples per language).

Validation Loss	Precision	Recall	F1-Score	Accuracy
0.029172	0.919623	0.933586	0.926552	0.991883

Downloads last month: 14

Inference Examples

Token Classification

This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Datasets used to train nikitast/lang-segmentation-roberta

Space using nikitast/lang-segmentation-roberta 1