--- base_model: roberta-base datasets: - conll2003 language: - en library_name: span-marker license: apache-2.0 metrics: - precision - recall - f1 pipeline_tag: token-classification tags: - span-marker - token-classification - ner - named-entity-recognition - generated_from_span_marker_trainer widget: - text: '" The worst thing that could happen for financial markets is that if Clinton and Dole start to trade shots in the middle of the ring with one-upmanship, " said Hugh Johnson, chief investment officer at First Albany Corp. " That''s when Wall Street will need to worry . "' - text: Poland revived diplomatic ties at ambassadorial level with Yugoslavia in April but economic links are almost moribund, despite the end of a three-year U.N. trade embargo imposed to punish Belgrade for its support of Bosnian Serbs. - text: '" We believe that the Israeli settlement policy in the occupied areas is an obstacle to the establishment of peace, " German Foreign Ministry spokesman Martin Erdmann said.' - text: U.S. Agriculture Department officials said Friday that Mexican avocados--which are restricted from entering the continental United States--will not likely be entering U.S. markets any time soon, even if the controversial ban were lifted today. - text: 3. Tristan Hoffman (Netherlands) TVM same time model-index: - name: SpanMarker with roberta-base on conll2003 results: - task: type: token-classification name: Named Entity Recognition dataset: name: Unknown type: conll2003 split: test metrics: - type: f1 value: 0.9022464022464022 name: F1 - type: precision value: 0.8943980514961726 name: Precision - type: recall value: 0.9102337110481586 name: Recall --- # SpanMarker with roberta-base on conll2003 This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model trained on the [conll2003](https://huggingface.co/datasets/conll2003) dataset that can be used for Named Entity Recognition. This SpanMarker model uses [roberta-base](https://huggingface.co/roberta-base) as the underlying encoder. ## Model Details ### Model Description - **Model Type:** SpanMarker - **Encoder:** [roberta-base](https://huggingface.co/roberta-base) - **Maximum Sequence Length:** 256 tokens - **Maximum Entity Length:** 6 words - **Training Dataset:** [conll2003](https://huggingface.co/datasets/conll2003) - **Language:** en - **License:** apache-2.0 ### Model Sources - **Repository:** [SpanMarker on GitHub](https://github.com/tomaarsen/SpanMarkerNER) - **Thesis:** [SpanMarker For Named Entity Recognition](https://raw.githubusercontent.com/tomaarsen/SpanMarkerNER/main/thesis.pdf) ### Model Labels | Label | Examples | |:------|:--------------------------------------------------------------| | LOC | "BRUSSELS", "Britain", "Germany" | | MISC | "British", "EU-wide", "German" | | ORG | "EU", "European Commission", "European Union" | | PER | "Werner Zwingmann", "Nikolaus van der Pas", "Peter Blackburn" | ## Evaluation ### Metrics | Label | Precision | Recall | F1 | |:--------|:----------|:-------|:-------| | **all** | 0.8944 | 0.9102 | 0.9022 | | LOC | 0.9220 | 0.9215 | 0.9217 | | MISC | 0.7332 | 0.7949 | 0.7628 | | ORG | 0.8764 | 0.8964 | 0.8863 | | PER | 0.9605 | 0.9629 | 0.9617 | ## Uses ### Direct Use for Inference ```python from span_marker import SpanMarkerModel # Download from the 🤗 Hub model = SpanMarkerModel.from_pretrained("span_marker_model_id") # Run inference entities = model.predict("3. Tristan Hoffman (Netherlands) TVM same time") ``` ### Downstream Use You can finetune this model on your own dataset.
Click to expand ```python from span_marker import SpanMarkerModel, Trainer # Download from the 🤗 Hub model = SpanMarkerModel.from_pretrained("span_marker_model_id") # Specify a Dataset with "tokens" and "ner_tag" columns dataset = load_dataset("conll2003") # For example CoNLL2003 # Initialize a Trainer using the pretrained model & dataset trainer = Trainer( model=model, train_dataset=dataset["train"], eval_dataset=dataset["validation"], ) trainer.train() trainer.save_model("span_marker_model_id-finetuned") ```
## Training Details ### Training Set Metrics | Training set | Min | Median | Max | |:----------------------|:----|:--------|:----| | Sentence length | 1 | 14.5019 | 113 | | Entities per sentence | 0 | 1.6736 | 20 | ### Training Hyperparameters - learning_rate: 1e-05 - train_batch_size: 4 - eval_batch_size: 4 - seed: 42 - gradient_accumulation_steps: 2 - total_train_batch_size: 8 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: linear - lr_scheduler_warmup_ratio: 0.1 - num_epochs: 1 - mixed_precision_training: Native AMP ### Training Results | Epoch | Step | Validation Loss | Validation Precision | Validation Recall | Validation F1 | Validation Accuracy | |:------:|:----:|:---------------:|:--------------------:|:-----------------:|:-------------:|:-------------------:| | 0.2775 | 500 | 0.0282 | 0.9105 | 0.8355 | 0.8714 | 0.9670 | | 0.5549 | 1000 | 0.0166 | 0.9215 | 0.9205 | 0.9210 | 0.9824 | | 0.8324 | 1500 | 0.0151 | 0.9247 | 0.9346 | 0.9296 | 0.9853 | ### Framework Versions - Python: 3.10.12 - SpanMarker: 1.5.0 - Transformers: 4.41.2 - PyTorch: 2.3.0+cu121 - Datasets: 2.20.0 - Tokenizers: 0.19.1 ## Citation ### BibTeX ``` @software{Aarsen_SpanMarker, author = {Aarsen, Tom}, license = {Apache-2.0}, title = {{SpanMarker for Named Entity Recognition}}, url = {https://github.com/tomaarsen/SpanMarkerNER} } ```