Spaces:

DeepMount00
/

universal_ner_ita

Running

App Files Files Community

DeepMount00 commited on Mar 11

Commit

cc8997b

•

1 Parent(s): b1d7709

Upload 12 files

Browse files

Files changed (12) hide show

GLiNER/README.md +90 -0
GLiNER/model.py +412 -0
GLiNER/modules/base.py +150 -0
GLiNER/modules/data_proc.py +73 -0
GLiNER/modules/evaluator.py +152 -0
GLiNER/modules/layers.py +28 -0
GLiNER/modules/run_evaluation.py +188 -0
GLiNER/modules/span_rep.py +369 -0
GLiNER/modules/token_rep.py +54 -0
GLiNER/requirements.txt +6 -0
GLiNER/save_load.py +20 -0
GLiNER/train.py +131 -0

GLiNER/README.md ADDED Viewed

	@@ -0,0 +1,90 @@

+# Model Card for GLiNER-base
+GLiNER is a Named Entity Recognition (NER) model capable of identifying any entity type using a bidirectional transformer encoder (BERT-like). It provides a practical alternative to traditional NER models, which are limited to predefined entities, and Large Language Models (LLMs) that, despite their flexibility, are costly and large for resource-constrained scenarios.
+## Models Status
+### Available Models on Hugging Face
+- [x] [GLiNER-Base](https://huggingface.co/urchade/gliner_base) (CC BY NC 4.0)
+- [x] [GLiNER-Multi](https://huggingface.co/urchade/gliner_multi) (CC BY NC 4.0)
+- [x] [GLiNER-small](https://huggingface.co/urchade/gliner_small) (CC BY NC 4.0)
+- [x] [GLiNER-small-v2](https://huggingface.co/urchade/gliner_smallv2) (Apache)
+- [x] [GLiNER-medium](https://huggingface.co/urchade/gliner_medium) (CC BY NC 4.0)
+- [x] [GLiNER-medium-v2](https://huggingface.co/urchade/gliner_mediumv2) (Apache)
+- [x] [GLiNER-large](https://huggingface.co/urchade/gliner_large) (CC BY NC 4.0)
+- [x] [GLiNER-ledium-v2](https://huggingface.co/urchade/gliner_largev2) (Apache)
+### To Release
+- [ ] ⏳ GLiNER-Multiv2
+- [ ] ⏳ GLiNER-Sup (trained on mixture of NER datasets)
+## Links
+* Paper: https://arxiv.org/abs/2311.08526
+* Repository: https://github.com/urchade/GLiNER
+## Installation
+To use this model, you must download the GLiNER repository and install its dependencies:
+```
+!git clone https://github.com/urchade/GLiNER.git
+%cd GLiNER
+!pip install -r requirements.txt
+```
+## Usage
+Once you've downloaded the GLiNER repository, you can import the GLiNER class from the `model` file. You can then load this model using `GLiNER.from_pretrained` and predict entities with `predict_entities`.
+```python
+from model import GLiNER
+model = GLiNER.from_pretrained("urchade/gliner_base")
+text = """
+Cristiano Ronaldo dos Santos Aveiro (Portuguese pronunciation: [kɾiʃˈtjɐnu ʁɔˈnaldu]; born 5 February 1985) is a Portuguese professional footballer who plays as a forward for and captains both Saudi Pro League club Al Nassr and the Portugal national team. Widely regarded as one of the greatest players of all time, Ronaldo has won five Ballon d'Or awards,[note 3] a record three UEFA Men's Player of the Year Awards, and four European Golden Shoes, the most by a European player. He has won 33 trophies in his career, including seven league titles, five UEFA Champions Leagues, the UEFA European Championship and the UEFA Nations League. Ronaldo holds the records for most appearances (183), goals (140) and assists (42) in the Champions League, goals in the European Championship (14), international goals (128) and international appearances (205). He is one of the few players to have made over 1,200 professional career appearances, the most by an outfield player, and has scored over 850 official senior career goals for club and country, making him the top goalscorer of all time.
+"""
+labels = ["person", "award", "date", "competitions", "teams"]
+entities = model.predict_entities(text, labels)
+for entity in entities:
+    print(entity["text"], "=>", entity["label"])
+```
+```
+Cristiano Ronaldo dos Santos Aveiro => person
+5 February 1985 => date
+Al Nassr => teams
+Portugal national team => teams
+Ballon d'Or => award
+UEFA Men's Player of the Year Awards => award
+European Golden Shoes => award
+UEFA Champions Leagues => competitions
+UEFA European Championship => competitions
+UEFA Nations League => competitions
+Champions League => competitions
+European Championship => competitions
+```
+## Named Entity Recognition benchmark result
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/6317233cc92fd6fee317e030/Y5f7tK8lonGqeeO6L6bVI.png)
+## Model Authors
+The model authors are:
+* [Urchade Zaratiana](https://huggingface.co/urchade)
+* Nadi Tomeh
+* Pierre Holat
+* Thierry Charnois
+## Citation
+```bibtex
+@misc{zaratiana2023gliner,
+      title={GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer},
+      author={Urchade Zaratiana and Nadi Tomeh and Pierre Holat and Thierry Charnois},
+      year={2023},
+      eprint={2311.08526},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+```

GLiNER/model.py ADDED Viewed

	@@ -0,0 +1,412 @@

+import argparse
+import json
+from pathlib import Path
+import re
+from typing import Dict, Optional, Union
+import torch
+import torch.nn.functional as F
+from modules.layers import LstmSeq2SeqEncoder
+from modules.base import InstructBase
+from modules.evaluator import Evaluator, greedy_search
+from modules.span_rep import SpanRepLayer
+from modules.token_rep import TokenRepLayer
+from torch import nn
+from torch.nn.utils.rnn import pad_sequence
+from huggingface_hub import PyTorchModelHubMixin, hf_hub_download
+from huggingface_hub.utils import HfHubHTTPError
+class GLiNER(InstructBase, PyTorchModelHubMixin):
+    def __init__(self, config):
+        super().__init__(config)
+        self.config = config
+        # [ENT] token
+        self.entity_token = "<<ENT>>"
+        self.sep_token = "<<SEP>>"
+        # usually a pretrained bidirectional transformer, returns first subtoken representation
+        self.token_rep_layer = TokenRepLayer(model_name=config.model_name, fine_tune=config.fine_tune,
+                                             subtoken_pooling=config.subtoken_pooling, hidden_size=config.hidden_size,
+                                             add_tokens=[self.entity_token, self.sep_token])
+        # hierarchical representation of tokens
+        self.rnn = LstmSeq2SeqEncoder(
+            input_size=config.hidden_size,
+            hidden_size=config.hidden_size // 2,
+            num_layers=1,
+            bidirectional=True,
+        )
+        # span representation
+        self.span_rep_layer = SpanRepLayer(
+            span_mode=config.span_mode,
+            hidden_size=config.hidden_size,
+            max_width=config.max_width,
+            dropout=config.dropout,
+        )
+        # prompt representation (FFN)
+        self.prompt_rep_layer = nn.Sequential(
+            nn.Linear(config.hidden_size, config.hidden_size * 4),
+            nn.Dropout(config.dropout),
+            nn.ReLU(),
+            nn.Linear(config.hidden_size * 4, config.hidden_size)
+        )
+    def compute_score_train(self, x):
+        span_idx = x['span_idx'] * x['span_mask'].unsqueeze(-1)
+        new_length = x['seq_length'].clone()
+        new_tokens = []
+        all_len_prompt = []
+        num_classes_all = []
+        # add prompt to the tokens
+        for i in range(len(x['tokens'])):
+            all_types_i = list(x['classes_to_id'][i].keys())
+            # multiple entity types in all_types. Prompt is appended at the start of tokens
+            entity_prompt = []
+            num_classes_all.append(len(all_types_i))
+            # add enity types to prompt
+            for entity_type in all_types_i:
+                entity_prompt.append(self.entity_token)  # [ENT] token
+                entity_prompt.append(entity_type)  # entity type
+            entity_prompt.append(self.sep_token)  # [SEP] token
+            # prompt format:
+            # [ENT] entity_type [ENT] entity_type ... [ENT] entity_type [SEP]
+            # add prompt to the tokens
+            tokens_p = entity_prompt + x['tokens'][i]
+            # input format:
+            # [ENT] entity_type_1 [ENT] entity_type_2 ... [ENT] entity_type_m [SEP] token_1 token_2 ... token_n
+            # update length of the sequence (add prompt length to the original length)
+            new_length[i] = new_length[i] + len(entity_prompt)
+            # update tokens
+            new_tokens.append(tokens_p)
+            # store prompt length
+            all_len_prompt.append(len(entity_prompt))
+        # create a mask using num_classes_all (0, if it exceeds the number of classes, 1 otherwise)
+        max_num_classes = max(num_classes_all)
+        entity_type_mask = torch.arange(max_num_classes).unsqueeze(0).expand(len(num_classes_all), -1).to(
+            x['span_mask'].device)
+        entity_type_mask = entity_type_mask < torch.tensor(num_classes_all).unsqueeze(-1).to(
+            x['span_mask'].device)  # [batch_size, max_num_classes]
+        # compute all token representations
+        bert_output = self.token_rep_layer(new_tokens, new_length)
+        word_rep_w_prompt = bert_output["embeddings"]  # embeddings for all tokens (with prompt)
+        mask_w_prompt = bert_output["mask"]  # mask for all tokens (with prompt)
+        # get word representation (after [SEP]), mask (after [SEP]) and entity type representation (before [SEP])
+        word_rep = []  # word representation (after [SEP])
+        mask = []  # mask (after [SEP])
+        entity_type_rep = []  # entity type representation (before [SEP])
+        for i in range(len(x['tokens'])):
+            prompt_entity_length = all_len_prompt[i]  # length of prompt for this example
+            # get word representation (after [SEP])
+            word_rep.append(word_rep_w_prompt[i, prompt_entity_length:prompt_entity_length + x['seq_length'][i]])
+            # get mask (after [SEP])
+            mask.append(mask_w_prompt[i, prompt_entity_length:prompt_entity_length + x['seq_length'][i]])
+            # get entity type representation (before [SEP])
+            entity_rep = word_rep_w_prompt[i, :prompt_entity_length - 1]  # remove [SEP]
+            entity_rep = entity_rep[0::2]  # it means that we take every second element starting from the second one
+            entity_type_rep.append(entity_rep)
+        # padding for word_rep, mask and entity_type_rep
+        word_rep = pad_sequence(word_rep, batch_first=True)  # [batch_size, seq_len, hidden_size]
+        mask = pad_sequence(mask, batch_first=True)  # [batch_size, seq_len]
+        entity_type_rep = pad_sequence(entity_type_rep, batch_first=True)  # [batch_size, len_types, hidden_size]
+        # compute span representation
+        word_rep = self.rnn(word_rep, mask)
+        span_rep = self.span_rep_layer(word_rep, span_idx)
+        # compute final entity type representation (FFN)
+        entity_type_rep = self.prompt_rep_layer(entity_type_rep)  # (batch_size, len_types, hidden_size)
+        num_classes = entity_type_rep.shape[1]  # number of entity types
+        # similarity score
+        scores = torch.einsum('BLKD,BCD->BLKC', span_rep, entity_type_rep)
+        return scores, num_classes, entity_type_mask
+    def forward(self, x):
+        # compute span representation
+        scores, num_classes, entity_type_mask = self.compute_score_train(x)
+        batch_size = scores.shape[0]
+        # loss for filtering classifier
+        logits_label = scores.view(-1, num_classes)
+        labels = x["span_label"].view(-1)  # (batch_size * num_spans)
+        mask_label = labels != -1  # (batch_size * num_spans)
+        labels.masked_fill_(~mask_label, 0)  # Set the labels of padding tokens to 0
+        # one-hot encoding
+        labels_one_hot = torch.zeros(labels.size(0), num_classes + 1, dtype=torch.float32).to(scores.device)
+        labels_one_hot.scatter_(1, labels.unsqueeze(1), 1)  # Set the corresponding index to 1
+        labels_one_hot = labels_one_hot[:, 1:]  # Remove the first column
+        # Shape of labels_one_hot: (batch_size * num_spans, num_classes)
+        # compute loss (without reduction)
+        all_losses = F.binary_cross_entropy_with_logits(logits_label, labels_one_hot,
+                                                        reduction='none')
+        # mask loss using entity_type_mask (B, C)
+        masked_loss = all_losses.view(batch_size, -1, num_classes) * entity_type_mask.unsqueeze(1)
+        all_losses = masked_loss.view(-1, num_classes)
+        # expand mask_label to all_losses
+        mask_label = mask_label.unsqueeze(-1).expand_as(all_losses)
+        # put lower loss for in label_one_hot (2 for positive, 1 for negative)
+        weight_c = labels_one_hot + 1
+        # apply mask
+        all_losses = all_losses * mask_label.float() * weight_c
+        return all_losses.sum()
+    def compute_score_eval(self, x, device):
+        # check if classes_to_id is dict
+        assert isinstance(x['classes_to_id'], dict), "classes_to_id must be a dict"
+        span_idx = (x['span_idx'] * x['span_mask'].unsqueeze(-1)).to(device)
+        all_types = list(x['classes_to_id'].keys())
+        # multiple entity types in all_types. Prompt is appended at the start of tokens
+        entity_prompt = []
+        # add enity types to prompt
+        for entity_type in all_types:
+            entity_prompt.append(self.entity_token)
+            entity_prompt.append(entity_type)
+        entity_prompt.append(self.sep_token)
+        prompt_entity_length = len(entity_prompt)
+        # add prompt
+        tokens_p = [entity_prompt + tokens for tokens in x['tokens']]
+        seq_length_p = x['seq_length'] + prompt_entity_length
+        out = self.token_rep_layer(tokens_p, seq_length_p)
+        word_rep_w_prompt = out["embeddings"]
+        mask_w_prompt = out["mask"]
+        # remove prompt
+        word_rep = word_rep_w_prompt[:, prompt_entity_length:, :]
+        mask = mask_w_prompt[:, prompt_entity_length:]
+        # get_entity_type_rep
+        entity_type_rep = word_rep_w_prompt[:, :prompt_entity_length - 1, :]
+        # extract [ENT] tokens (which are at even positions in entity_type_rep)
+        entity_type_rep = entity_type_rep[:, 0::2, :]
+        entity_type_rep = self.prompt_rep_layer(entity_type_rep)  # (batch_size, len_types, hidden_size)
+        word_rep = self.rnn(word_rep, mask)
+        span_rep = self.span_rep_layer(word_rep, span_idx)
+        local_scores = torch.einsum('BLKD,BCD->BLKC', span_rep, entity_type_rep)
+        return local_scores
+    @torch.no_grad()
+    def predict(self, x, flat_ner=False, threshold=0.5):
+        self.eval()
+        local_scores = self.compute_score_eval(x, device=next(self.parameters()).device)
+        spans = []
+        for i, _ in enumerate(x["tokens"]):
+            local_i = local_scores[i]
+            wh_i = [i.tolist() for i in torch.where(torch.sigmoid(local_i) > threshold)]
+            span_i = []
+            for s, k, c in zip(*wh_i):
+                if s + k < len(x["tokens"][i]):
+                    span_i.append((s, s + k, x["id_to_classes"][c + 1], local_i[s, k, c]))
+            span_i = greedy_search(span_i, flat_ner)
+            spans.append(span_i)
+        return spans
+    def predict_entities(self, text, labels, flat_ner=True, threshold=0.5):
+        tokens = []
+        start_token_idx_to_text_idx = []
+        end_token_idx_to_text_idx = []
+        for match in re.finditer(r'\w+(?:[-_]\w+)*|\S', text):
+            tokens.append(match.group())
+            start_token_idx_to_text_idx.append(match.start())
+            end_token_idx_to_text_idx.append(match.end())
+        input_x = {"tokenized_text": tokens, "ner": None}
+        x = self.collate_fn([input_x], labels)
+        output = self.predict(x, flat_ner=flat_ner, threshold=threshold)
+        entities = []
+        for start_token_idx, end_token_idx, ent_type in output[0]:
+            start_text_idx = start_token_idx_to_text_idx[start_token_idx]
+            end_text_idx = end_token_idx_to_text_idx[end_token_idx]
+            entities.append({
+                "start": start_token_idx_to_text_idx[start_token_idx],
+                "end": end_token_idx_to_text_idx[end_token_idx],
+                "text": text[start_text_idx:end_text_idx],
+                "label": ent_type,
+            })
+        return entities
+    def evaluate(self, test_data, flat_ner=False, threshold=0.5, batch_size=12, entity_types=None):
+        self.eval()
+        data_loader = self.create_dataloader(test_data, batch_size=batch_size, entity_types=entity_types, shuffle=False)
+        device = next(self.parameters()).device
+        all_preds = []
+        all_trues = []
+        for x in data_loader:
+            for k, v in x.items():
+                if isinstance(v, torch.Tensor):
+                    x[k] = v.to(device)
+            batch_predictions = self.predict(x, flat_ner, threshold)
+            all_preds.extend(batch_predictions)
+            all_trues.extend(x["entities"])
+        evaluator = Evaluator(all_trues, all_preds)
+        out, f1 = evaluator.evaluate()
+        return out, f1
+    @classmethod
+    def _from_pretrained(
+        cls,
+        *,
+        model_id: str,
+        revision: Optional[str],
+        cache_dir: Optional[Union[str, Path]],
+        force_download: bool,
+        proxies: Optional[Dict],
+        resume_download: bool,
+        local_files_only: bool,
+        token: Union[str, bool, None],
+        map_location: str = "cpu",
+        strict: bool = False,
+        **model_kwargs,
+    ):
+        # 1. Backwards compatibility: Use "gliner_base.pt" and "gliner_multi.pt" with all data
+        filenames = ["gliner_base.pt", "gliner_multi.pt"]
+        for filename in filenames:
+            model_file = Path(model_id) / filename
+            if not model_file.exists():
+                try:
+                    model_file = hf_hub_download(
+                        repo_id=model_id,
+                        filename=filename,
+                        revision=revision,
+                        cache_dir=cache_dir,
+                        force_download=force_download,
+                        proxies=proxies,
+                        resume_download=resume_download,
+                        token=token,
+                        local_files_only=local_files_only,
+                    )
+                except HfHubHTTPError:
+                    continue
+            dict_load = torch.load(model_file, map_location=torch.device(map_location))
+            config = dict_load["config"]
+            state_dict = dict_load["model_weights"]
+            config.model_name = "microsoft/deberta-v3-base" if filename == "gliner_base.pt" else "microsoft/mdeberta-v3-base"
+            model = cls(config)
+            model.load_state_dict(state_dict, strict=strict, assign=True)
+            # Required to update flair's internals as well:
+            model.to(map_location)
+            return model
+        # 2. Newer format: Use "pytorch_model.bin" and "gliner_config.json"
+        from train import load_config_as_namespace
+        model_file = Path(model_id) / "pytorch_model.bin"
+        if not model_file.exists():
+            model_file = hf_hub_download(
+                repo_id=model_id,
+                filename="pytorch_model.bin",
+                revision=revision,
+                cache_dir=cache_dir,
+                force_download=force_download,
+                proxies=proxies,
+                resume_download=resume_download,
+                token=token,
+                local_files_only=local_files_only,
+            )
+        config_file = Path(model_id) / "gliner_config.json"
+        if not config_file.exists():
+            config_file = hf_hub_download(
+                repo_id=model_id,
+                filename="gliner_config.json",
+                revision=revision,
+                cache_dir=cache_dir,
+                force_download=force_download,
+                proxies=proxies,
+                resume_download=resume_download,
+                token=token,
+                local_files_only=local_files_only,
+            )
+        config = load_config_as_namespace(config_file)
+        model = cls(config)
+        state_dict = torch.load(model_file, map_location=torch.device(map_location))
+        model.load_state_dict(state_dict, strict=strict, assign=True)
+        model.to(map_location)
+        return model
+    def save_pretrained(
+        self,
+        save_directory: Union[str, Path],
+        *,
+        config: Optional[Union[dict, "DataclassInstance"]] = None,
+        repo_id: Optional[str] = None,
+        push_to_hub: bool = False,
+        **push_to_hub_kwargs,
+    ) -> Optional[str]:
+        """
+        Save weights in local directory.
+        Args:
+            save_directory (`str` or `Path`):
+                Path to directory in which the model weights and configuration will be saved.
+            config (`dict` or `DataclassInstance`, *optional*):
+                Model configuration specified as a key/value dictionary or a dataclass instance.
+            push_to_hub (`bool`, *optional*, defaults to `False`):
+                Whether or not to push your model to the Huggingface Hub after saving it.
+            repo_id (`str`, *optional*):
+                ID of your repository on the Hub. Used only if `push_to_hub=True`. Will default to the folder name if
+                not provided.
+            kwargs:
+                Additional key word arguments passed along to the [`~ModelHubMixin.push_to_hub`] method.
+        """
+        save_directory = Path(save_directory)
+        save_directory.mkdir(parents=True, exist_ok=True)
+        # save model weights/files
+        torch.save(self.state_dict(), save_directory / "pytorch_model.bin")
+        # save config (if provided)
+        if config is None:
+            config = self.config
+        if config is not None:
+            if isinstance(config, argparse.Namespace):
+                config = vars(config)
+            (save_directory / "gliner_config.json").write_text(json.dumps(config, indent=2))
+        # push to the Hub if required
+        if push_to_hub:
+            kwargs = push_to_hub_kwargs.copy()  # soft-copy to avoid mutating input
+            if config is not None:  # kwarg for `push_to_hub`
+                kwargs["config"] = config
+            if repo_id is None:
+                repo_id = save_directory.name  # Defaults to `save_directory` name
+            return self.push_to_hub(repo_id=repo_id, **kwargs)
+        return None
+    def to(self, device):
+        super().to(device)
+        import flair
+        flair.device = device
+        return self

GLiNER/modules/base.py ADDED Viewed

	@@ -0,0 +1,150 @@

+from collections import defaultdict
+from typing import List, Tuple, Dict
+import torch
+from torch import nn
+from torch.nn.utils.rnn import pad_sequence
+from torch.utils.data import DataLoader
+import random
+class InstructBase(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.max_width = config.max_width
+        self.base_config = config
+    def get_dict(self, spans, classes_to_id):
+        dict_tag = defaultdict(int)
+        for span in spans:
+            if span[2] in classes_to_id:
+                dict_tag[(span[0], span[1])] = classes_to_id[span[2]]
+        return dict_tag
+    def preprocess_spans(self, tokens, ner, classes_to_id):
+        max_len = self.base_config.max_len
+        if len(tokens) > max_len:
+            length = max_len
+            tokens = tokens[:max_len]
+        else:
+            length = len(tokens)
+        spans_idx = []
+        for i in range(length):
+            spans_idx.extend([(i, i + j) for j in range(self.max_width)])
+        dict_lab = self.get_dict(ner, classes_to_id) if ner else defaultdict(int)
+        # 0 for null labels
+        span_label = torch.LongTensor([dict_lab[i] for i in spans_idx])
+        spans_idx = torch.LongTensor(spans_idx)
+        # mask for valid spans
+        valid_span_mask = spans_idx[:, 1] > length - 1
+        # mask invalid positions
+        span_label = span_label.masked_fill(valid_span_mask, -1)
+        return {
+            'tokens': tokens,
+            'span_idx': spans_idx,
+            'span_label': span_label,
+            'seq_length': length,
+            'entities': ner,
+        }
+    def collate_fn(self, batch_list, entity_types=None):
+        # batch_list: list of dict containing tokens, ner
+        if entity_types is None:
+            negs = self.get_negatives(batch_list, 100)
+            class_to_ids = []
+            id_to_classes = []
+            for b in batch_list:
+                # negs = b["negative"]
+                random.shuffle(negs)
+                # negs = negs[:sampled_neg]
+                max_neg_type_ratio = int(self.base_config.max_neg_type_ratio)
+                if max_neg_type_ratio == 0:
+                    # no negatives
+                    neg_type_ratio = 0
+                else:
+                    neg_type_ratio = random.randint(0, max_neg_type_ratio)
+                if neg_type_ratio == 0:
+                    # no negatives
+                    negs_i = []
+                else:
+                    negs_i = negs[:len(b['ner']) * neg_type_ratio]
+                # this is the list of all possible entity types (positive and negative)
+                types = list(set([el[-1] for el in b['ner']] + negs_i))
+                # shuffle (every epoch)
+                random.shuffle(types)
+                if len(types) != 0:
+                    # prob of higher number shoul
+                    # random drop
+                    if self.base_config.random_drop:
+                        num_ents = random.randint(1, len(types))
+                        types = types[:num_ents]
+                # maximum number of entities types
+                types = types[:int(self.base_config.max_types)]
+                # supervised training
+                if "label" in b:
+                    types = sorted(b["label"])
+                class_to_id = {k: v for v, k in enumerate(types, start=1)}
+                id_to_class = {k: v for v, k in class_to_id.items()}
+                class_to_ids.append(class_to_id)
+                id_to_classes.append(id_to_class)
+            batch = [
+                self.preprocess_spans(b["tokenized_text"], b["ner"], class_to_ids[i]) for i, b in enumerate(batch_list)
+            ]
+        else:
+            class_to_ids = {k: v for v, k in enumerate(entity_types, start=1)}
+            id_to_classes = {k: v for v, k in class_to_ids.items()}
+            batch = [
+                self.preprocess_spans(b["tokenized_text"], b["ner"], class_to_ids) for b in batch_list
+            ]
+        span_idx = pad_sequence(
+            [b['span_idx'] for b in batch], batch_first=True, padding_value=0
+        )
+        span_label = pad_sequence(
+            [el['span_label'] for el in batch], batch_first=True, padding_value=-1
+        )
+        return {
+            'seq_length': torch.LongTensor([el['seq_length'] for el in batch]),
+            'span_idx': span_idx,
+            'tokens': [el['tokens'] for el in batch],
+            'span_mask': span_label != -1,
+            'span_label': span_label,
+            'entities': [el['entities'] for el in batch],
+            'classes_to_id': class_to_ids,
+            'id_to_classes': id_to_classes,
+        }
+    @staticmethod
+    def get_negatives(batch_list, sampled_neg=5):
+        ent_types = []
+        for b in batch_list:
+            types = set([el[-1] for el in b['ner']])
+            ent_types.extend(list(types))
+        ent_types = list(set(ent_types))
+        # sample negatives
+        random.shuffle(ent_types)
+        return ent_types[:sampled_neg]
+    def create_dataloader(self, data, entity_types=None, **kwargs):
+        return DataLoader(data, collate_fn=lambda x: self.collate_fn(x, entity_types), **kwargs)

GLiNER/modules/data_proc.py ADDED Viewed

	@@ -0,0 +1,73 @@

+import json
+from tqdm import tqdm
+# ast.literal_eval
+import ast, re
+path = 'train.json'
+with open(path, 'r') as f:
+    data = json.load(f)
+def tokenize_text(text):
+    return re.findall(r'\w+(?:[-_]\w+)*|\S', text)
+def extract_entity_spans(entry):
+    text = ""
+    len_start = len("What describes ")
+    len_end = len(" in the text?")
+    entity_types = []
+    entity_texts = []
+    for c in entry['conversations']:
+        if c['from'] == 'human' and c['value'].startswith('Text: '):
+            text = c['value'][len('Text: '):]
+            tokenized_text = tokenize_text(text)
+        if c['from'] == 'human' and c['value'].startswith('What describes '):
+            c_type = c['value'][len_start:-len_end]
+            c_type = c_type.replace(' ', '_')
+            entity_types.append(c_type)
+        elif c['from'] == 'gpt' and c['value'].startswith('['):
+            if c['value'] == '[]':
+                entity_types = entity_types[:-1]
+                continue
+            texts_ents = ast.literal_eval(c['value'])
+            # replace space to _ in texts_ents
+            entity_texts.extend(texts_ents)
+            num_repeat = len(texts_ents) - 1
+            entity_types.extend([entity_types[-1]] * num_repeat)
+    entity_spans = []
+    for j, entity_text in enumerate(entity_texts):
+        entity_tokens = tokenize_text(entity_text)
+        matches = []
+        for i in range(len(tokenized_text) - len(entity_tokens) + 1):
+            if " ".join(tokenized_text[i:i + len(entity_tokens)]).lower() == " ".join(entity_tokens).lower():
+                matches.append((i, i + len(entity_tokens) - 1, entity_types[j]))
+        if matches:
+            entity_spans.extend(matches)
+    return entity_spans, tokenized_text
+# Usage:
+# Replace 'entry' with the specific entry from your JSON data
+entry = data[17818]  # For example, taking the first entry
+entity_spans, tokenized_text = extract_entity_spans(entry)
+print("Entity Spans:", entity_spans)
+#print("Tokenized Text:", tokenized_text)
+# create a dict: {"tokenized_text": tokenized_text, "entity_spans": entity_spans}
+all_data = []
+for entry in tqdm(data):
+    entity_spans, tokenized_text = extract_entity_spans(entry)
+    all_data.append({"tokenized_text": tokenized_text, "ner": entity_spans})
+with open('train_instruct.json', 'w') as f:
+    json.dump(all_data, f)

GLiNER/modules/evaluator.py ADDED Viewed

	@@ -0,0 +1,152 @@

+from collections import defaultdict
+import numpy as np
+import torch
+from seqeval.metrics.v1 import _prf_divide
+def extract_tp_actual_correct(y_true, y_pred):
+    entities_true = defaultdict(set)
+    entities_pred = defaultdict(set)
+    for type_name, (start, end), idx in y_true:
+        entities_true[type_name].add((start, end, idx))
+    for type_name, (start, end), idx in y_pred:
+        entities_pred[type_name].add((start, end, idx))
+    target_names = sorted(set(entities_true.keys()) | set(entities_pred.keys()))
+    tp_sum = np.array([], dtype=np.int32)
+    pred_sum = np.array([], dtype=np.int32)
+    true_sum = np.array([], dtype=np.int32)
+    for type_name in target_names:
+        entities_true_type = entities_true.get(type_name, set())
+        entities_pred_type = entities_pred.get(type_name, set())
+        tp_sum = np.append(tp_sum, len(entities_true_type & entities_pred_type))
+        pred_sum = np.append(pred_sum, len(entities_pred_type))
+        true_sum = np.append(true_sum, len(entities_true_type))
+    return pred_sum, tp_sum, true_sum, target_names
+def flatten_for_eval(y_true, y_pred):
+    all_true = []
+    all_pred = []
+    for i, (true, pred) in enumerate(zip(y_true, y_pred)):
+        all_true.extend([t + [i] for t in true])
+        all_pred.extend([p + [i] for p in pred])
+    return all_true, all_pred
+def compute_prf(y_true, y_pred, average='micro'):
+    y_true, y_pred = flatten_for_eval(y_true, y_pred)
+    pred_sum, tp_sum, true_sum, target_names = extract_tp_actual_correct(y_true, y_pred)
+    if average == 'micro':
+        tp_sum = np.array([tp_sum.sum()])
+        pred_sum = np.array([pred_sum.sum()])
+        true_sum = np.array([true_sum.sum()])
+    precision = _prf_divide(
+        numerator=tp_sum,
+        denominator=pred_sum,
+        metric='precision',
+        modifier='predicted',
+        average=average,
+        warn_for=('precision', 'recall', 'f-score'),
+        zero_division='warn'
+    )
+    recall = _prf_divide(
+        numerator=tp_sum,
+        denominator=true_sum,
+        metric='recall',
+        modifier='true',
+        average=average,
+        warn_for=('precision', 'recall', 'f-score'),
+        zero_division='warn'
+    )
+    denominator = precision + recall
+    denominator[denominator == 0.] = 1
+    f_score = 2 * (precision * recall) / denominator
+    return {'precision': precision[0], 'recall': recall[0], 'f_score': f_score[0]}
+class Evaluator:
+    def __init__(self, all_true, all_outs):
+        self.all_true = all_true
+        self.all_outs = all_outs
+    def get_entities_fr(self, ents):
+        all_ents = []
+        for s, e, lab in ents:
+            all_ents.append([lab, (s, e)])
+        return all_ents
+    def transform_data(self):
+        all_true_ent = []
+        all_outs_ent = []
+        for i, j in zip(self.all_true, self.all_outs):
+            e = self.get_entities_fr(i)
+            all_true_ent.append(e)
+            e = self.get_entities_fr(j)
+            all_outs_ent.append(e)
+        return all_true_ent, all_outs_ent
+    @torch.no_grad()
+    def evaluate(self):
+        all_true_typed, all_outs_typed = self.transform_data()
+        precision, recall, f1 = compute_prf(all_true_typed, all_outs_typed).values()
+        output_str = f"P: {precision:.2%}\tR: {recall:.2%}\tF1: {f1:.2%}\n"
+        return output_str, f1
+def is_nested(idx1, idx2):
+    # Return True if idx2 is nested inside idx1 or vice versa
+    return (idx1[0] <= idx2[0] and idx1[1] >= idx2[1]) or (idx2[0] <= idx1[0] and idx2[1] >= idx1[1])
+def has_overlapping(idx1, idx2):
+    overlapping = True
+    if idx1[:2] == idx2[:2]:
+        return overlapping
+    if (idx1[0] > idx2[1] or idx2[0] > idx1[1]):
+        overlapping = False
+    return overlapping
+def has_overlapping_nested(idx1, idx2):
+    # Return True if idx1 and idx2 overlap, but neither is nested inside the other
+    if idx1[:2] == idx2[:2]:
+        return True
+    if ((idx1[0] > idx2[1] or idx2[0] > idx1[1]) or is_nested(idx1, idx2)) and idx1 != idx2:
+        return False
+    else:
+        return True
+def greedy_search(spans, flat_ner=True):  # start, end, class, score
+    if flat_ner:
+        has_ov = has_overlapping
+    else:
+        has_ov = has_overlapping_nested
+    new_list = []
+    span_prob = sorted(spans, key=lambda x: -x[-1])
+    for i in range(len(spans)):
+        b = span_prob[i]
+        flag = False
+        for new in new_list:
+            if has_ov(b[:-1], new):
+                flag = True
+                break
+        if not flag:
+            new_list.append(b[:-1])
+    new_list = sorted(new_list, key=lambda x: x[0])
+    return new_list

GLiNER/modules/layers.py ADDED Viewed

	@@ -0,0 +1,28 @@

+import torch
+import torch.nn.functional as F
+from torch import nn
+from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
+class LstmSeq2SeqEncoder(nn.Module):
+    def __init__(self, input_size, hidden_size, num_layers=1, dropout=0., bidirectional=False):
+        super(LstmSeq2SeqEncoder, self).__init__()
+        self.lstm = nn.LSTM(input_size=input_size,
+                            hidden_size=hidden_size,
+                            num_layers=num_layers,
+                            dropout=dropout,
+                            bidirectional=bidirectional,
+                            batch_first=True)
+    def forward(self, x, mask, hidden=None):
+        # Packing the input sequence
+        lengths = mask.sum(dim=1).cpu()
+        packed_x = pack_padded_sequence(x, lengths, batch_first=True, enforce_sorted=False)
+        # Passing packed sequence through LSTM
+        packed_output, hidden = self.lstm(packed_x, hidden)
+        # Unpacking the output sequence
+        output, _ = pad_packed_sequence(packed_output, batch_first=True)
+        return output

GLiNER/modules/run_evaluation.py ADDED Viewed

	@@ -0,0 +1,188 @@

+import glob
+import json
+import os
+import os
+import torch
+from tqdm import tqdm
+import random
+def open_content(path):
+    paths = glob.glob(os.path.join(path, "*.json"))
+    train, dev, test, labels = None, None, None, None
+    for p in paths:
+        if "train" in p:
+            with open(p, "r") as f:
+                train = json.load(f)
+        elif "dev" in p:
+            with open(p, "r") as f:
+                dev = json.load(f)
+        elif "test" in p:
+            with open(p, "r") as f:
+                test = json.load(f)
+        elif "labels" in p:
+            with open(p, "r") as f:
+                labels = json.load(f)
+    return train, dev, test, labels
+def process(data):
+    words = data['sentence'].split()
+    entities = []  # List of entities (start, end, type)
+    for entity in data['entities']:
+        start_char, end_char = entity['pos']
+        # Initialize variables to keep track of word positions
+        start_word = None
+        end_word = None
+        # Iterate through words and find the word positions
+        char_count = 0
+        for i, word in enumerate(words):
+            word_length = len(word)
+            if char_count == start_char:
+                start_word = i
+            if char_count + word_length == end_char:
+                end_word = i
+                break
+            char_count += word_length + 1  # Add 1 for the space
+        # Append the word positions to the list
+        entities.append((start_word, end_word, entity['type']))
+    # Create a list of word positions for each entity
+    sample = {
+        "tokenized_text": words,
+        "ner": entities
+    }
+    return sample
+# create dataset
+def create_dataset(path):
+    train, dev, test, labels = open_content(path)
+    train_dataset = []
+    dev_dataset = []
+    test_dataset = []
+    for data in train:
+        train_dataset.append(process(data))
+    for data in dev:
+        dev_dataset.append(process(data))
+    for data in test:
+        test_dataset.append(process(data))
+    return train_dataset, dev_dataset, test_dataset, labels
+@torch.no_grad()
+def get_for_one_path(path, model):
+    # load the dataset
+    _, _, test_dataset, entity_types = create_dataset(path)
+    data_name = path.split("/")[-1]  # get the name of the dataset
+    # check if the dataset is flat_ner
+    flat_ner = True
+    if any([i in data_name for i in ["ACE", "GENIA", "Corpus"]]):
+        flat_ner = False
+    # evaluate the model
+    results, f1 = model.evaluate(test_dataset, flat_ner=flat_ner, threshold=0.5, batch_size=12,
+                                 entity_types=entity_types)
+    return data_name, results, f1
+def get_for_all_path(model, steps, log_dir, data_paths):
+    all_paths = glob.glob(f"{data_paths}/*")
+    all_paths = sorted(all_paths)
+    # move the model to the device
+    device = next(model.parameters()).device
+    model.to(device)
+    # set the model to eval mode
+    model.eval()
+    # log the results
+    save_path = os.path.join(log_dir, "results.txt")
+    with open(save_path, "a") as f:
+        f.write("##############################################\n")
+        # write step
+        f.write("step: " + str(steps) + "\n")
+    zero_shot_benc = ["mit-movie", "mit-restaurant", "CrossNER_AI", "CrossNER_literature", "CrossNER_music",
+                      "CrossNER_politics", "CrossNER_science"]
+    zero_shot_benc_results = {}
+    all_results = {}  # without crossNER
+    for p in tqdm(all_paths):
+        if "sample_" not in p:
+            data_name, results, f1 = get_for_one_path(p, model)
+            # write to file
+            with open(save_path, "a") as f:
+                f.write(data_name + "\n")
+                f.write(str(results) + "\n")
+            if data_name in zero_shot_benc:
+                zero_shot_benc_results[data_name] = f1
+            else:
+                all_results[data_name] = f1
+    avg_all = sum(all_results.values()) / len(all_results)
+    avg_zs = sum(zero_shot_benc_results.values()) / len(zero_shot_benc_results)
+    save_path_table = os.path.join(log_dir, "tables.txt")
+    # results for all datasets except crossNER
+    table_bench_all = ""
+    for k, v in all_results.items():
+        table_bench_all += f"{k:20}: {v:.1%}\n"
+    # (20 size aswell for average i.e. :20)
+    table_bench_all += f"{'Average':20}: {avg_all:.1%}"
+    # results for zero-shot benchmark
+    table_bench_zeroshot = ""
+    for k, v in zero_shot_benc_results.items():
+        table_bench_zeroshot += f"{k:20}: {v:.1%}\n"
+    table_bench_zeroshot += f"{'Average':20}: {avg_zs:.1%}"
+    # write to file
+    with open(save_path_table, "a") as f:
+        f.write("##############################################\n")
+        f.write("step: " + str(steps) + "\n")
+        f.write("Table for all datasets except crossNER\n")
+        f.write(table_bench_all + "\n\n")
+        f.write("Table for zero-shot benchmark\n")
+        f.write(table_bench_zeroshot + "\n")
+        f.write("##############################################\n\n")
+def sample_train_data(data_paths, sample_size=10000):
+    all_paths = glob.glob(f"{data_paths}/*")
+    all_paths = sorted(all_paths)
+    # to exclude the zero-shot benchmark datasets
+    zero_shot_benc = ["CrossNER_AI", "CrossNER_literature", "CrossNER_music",
+                      "CrossNER_politics", "CrossNER_science", "ACE 2004"]
+    new_train = []
+    # take 10k samples from each dataset
+    for p in tqdm(all_paths):
+        if any([i in p for i in zero_shot_benc]):
+            continue
+        train, dev, test, labels = create_dataset(p)
+        # add label key to the train data
+        for i in range(len(train)):
+            train[i]["label"] = labels
+        random.shuffle(train)
+        train = train[:sample_size]
+        new_train.extend(train)
+    return new_train

GLiNER/modules/span_rep.py ADDED Viewed

	@@ -0,0 +1,369 @@

+import torch
+import torch.nn.functional as F
+from torch import nn
+def create_projection_layer(hidden_size: int, dropout: float, out_dim: int = None) -> nn.Sequential:
+    """
+    Creates a projection layer with specified configurations.
+    """
+    if out_dim is None:
+        out_dim = hidden_size
+    return nn.Sequential(
+        nn.Linear(hidden_size, out_dim * 4),
+        nn.ReLU(),
+        nn.Dropout(dropout),
+        nn.Linear(out_dim * 4, out_dim)
+    )
+class SpanQuery(nn.Module):
+    def __init__(self, hidden_size, max_width, trainable=True):
+        super().__init__()
+        self.query_seg = nn.Parameter(torch.randn(hidden_size, max_width))
+        nn.init.uniform_(self.query_seg, a=-1, b=1)
+        if not trainable:
+            self.query_seg.requires_grad = False
+        self.project = nn.Sequential(
+            nn.Linear(hidden_size, hidden_size),
+            nn.ReLU()
+        )
+    def forward(self, h, *args):
+        # h of shape [B, L, D]
+        # query_seg of shape [D, max_width]
+        span_rep = torch.einsum('bld, ds->blsd', h, self.query_seg)
+        return self.project(span_rep)
+class SpanMLP(nn.Module):
+    def __init__(self, hidden_size, max_width):
+        super().__init__()
+        self.mlp = nn.Linear(hidden_size, hidden_size * max_width)
+    def forward(self, h, *args):
+        # h of shape [B, L, D]
+        # query_seg of shape [D, max_width]
+        B, L, D = h.size()
+        span_rep = self.mlp(h)
+        span_rep = span_rep.view(B, L, -1, D)
+        return span_rep.relu()
+class SpanCAT(nn.Module):
+    def __init__(self, hidden_size, max_width):
+        super().__init__()
+        self.max_width = max_width
+        self.query_seg = nn.Parameter(torch.randn(128, max_width))
+        self.project = nn.Sequential(
+            nn.Linear(hidden_size + 128, hidden_size),
+            nn.ReLU()
+        )
+    def forward(self, h, *args):
+        # h of shape [B, L, D]
+        # query_seg of shape [D, max_width]
+        B, L, D = h.size()
+        h = h.view(B, L, 1, D).repeat(1, 1, self.max_width, 1)
+        q = self.query_seg.view(1, 1, self.max_width, -1).repeat(B, L, 1, 1)
+        span_rep = torch.cat([h, q], dim=-1)
+        span_rep = self.project(span_rep)
+        return span_rep
+class SpanConvBlock(nn.Module):
+    def __init__(self, hidden_size, kernel_size, span_mode='conv_normal'):
+        super().__init__()
+        if span_mode == 'conv_conv':
+            self.conv = nn.Conv1d(hidden_size, hidden_size,
+                                  kernel_size=kernel_size)
+            # initialize the weights
+            nn.init.kaiming_uniform_(self.conv.weight, nonlinearity='relu')
+        elif span_mode == 'conv_max':
+            self.conv = nn.MaxPool1d(kernel_size=kernel_size, stride=1)
+        elif span_mode == 'conv_mean' or span_mode == 'conv_sum':
+            self.conv = nn.AvgPool1d(kernel_size=kernel_size, stride=1)
+        self.span_mode = span_mode
+        self.pad = kernel_size - 1
+    def forward(self, x):
+        x = torch.einsum('bld->bdl', x)
+        if self.pad > 0:
+            x = F.pad(x, (0, self.pad), "constant", 0)
+        x = self.conv(x)
+        if self.span_mode == "conv_sum":
+            x = x * (self.pad + 1)
+        return torch.einsum('bdl->bld', x)
+class SpanConv(nn.Module):
+    def __init__(self, hidden_size, max_width, span_mode):
+        super().__init__()
+        kernels = [i + 2 for i in range(max_width - 1)]
+        self.convs = nn.ModuleList()
+        for kernel in kernels:
+            self.convs.append(SpanConvBlock(hidden_size, kernel, span_mode))
+        self.project = nn.Sequential(
+            nn.ReLU(),
+            nn.Linear(hidden_size, hidden_size)
+        )
+    def forward(self, x, *args):
+        span_reps = [x]
+        for conv in self.convs:
+            h = conv(x)
+            span_reps.append(h)
+        span_reps = torch.stack(span_reps, dim=-2)
+        return self.project(span_reps)
+class SpanEndpointsBlock(nn.Module):
+    def __init__(self, kernel_size):
+        super().__init__()
+        self.kernel_size = kernel_size
+    def forward(self, x):
+        B, L, D = x.size()
+        span_idx = torch.LongTensor(
+            [[i, i + self.kernel_size - 1] for i in range(L)]).to(x.device)
+        x = F.pad(x, (0, 0, 0, self.kernel_size - 1), "constant", 0)
+        # endrep
+        start_end_rep = torch.index_select(x, dim=1, index=span_idx.view(-1))
+        start_end_rep = start_end_rep.view(B, L, 2, D)
+        return start_end_rep
+class ConvShare(nn.Module):
+    def __init__(self, hidden_size, max_width):
+        super().__init__()
+        self.max_width = max_width
+        self.conv_weigth = nn.Parameter(
+            torch.randn(hidden_size, hidden_size, max_width))
+        nn.init.kaiming_uniform_(self.conv_weigth, nonlinearity='relu')
+        self.project = nn.Sequential(
+            nn.ReLU(),
+            nn.Linear(hidden_size, hidden_size)
+        )
+    def forward(self, x, *args):
+        span_reps = []
+        x = torch.einsum('bld->bdl', x)
+        for i in range(self.max_width):
+            pad = i
+            x_i = F.pad(x, (0, pad), "constant", 0)
+            conv_w = self.conv_weigth[:, :, :i + 1]
+            out_i = F.conv1d(x_i, conv_w)
+            span_reps.append(out_i.transpose(-1, -2))
+        out = torch.stack(span_reps, dim=-2)
+        return self.project(out)
+def extract_elements(sequence, indices):
+    B, L, D = sequence.shape
+    K = indices.shape[1]
+    # Expand indices to [B, K, D]
+    expanded_indices = indices.unsqueeze(2).expand(-1, -1, D)
+    # Gather the elements
+    extracted_elements = torch.gather(sequence, 1, expanded_indices)
+    return extracted_elements
+class SpanMarker(nn.Module):
+    def __init__(self, hidden_size, max_width, dropout=0.4):
+        super().__init__()
+        self.max_width = max_width
+        self.project_start = nn.Sequential(
+            nn.Linear(hidden_size, hidden_size * 2, bias=True),
+            nn.ReLU(),
+            nn.Dropout(dropout),
+            nn.Linear(hidden_size * 2, hidden_size, bias=True),
+        )
+        self.project_end = nn.Sequential(
+            nn.Linear(hidden_size, hidden_size * 2, bias=True),
+            nn.ReLU(),
+            nn.Dropout(dropout),
+            nn.Linear(hidden_size * 2, hidden_size, bias=True),
+        )
+        self.out_project = nn.Linear(hidden_size * 2, hidden_size, bias=True)
+    def forward(self, h, span_idx):
+        # h of shape [B, L, D]
+        # query_seg of shape [D, max_width]
+        B, L, D = h.size()
+        # project start and end
+        start_rep = self.project_start(h)
+        end_rep = self.project_end(h)
+        start_span_rep = extract_elements(start_rep, span_idx[:, :, 0])
+        end_span_rep = extract_elements(end_rep, span_idx[:, :, 1])
+        # concat start and end
+        cat = torch.cat([start_span_rep, end_span_rep], dim=-1).relu()
+        # project
+        cat = self.out_project(cat)
+        # reshape
+        return cat.view(B, L, self.max_width, D)
+class SpanMarkerV0(nn.Module):
+    """
+    Marks and projects span endpoints using an MLP.
+    """
+    def __init__(self, hidden_size: int, max_width: int, dropout: float = 0.4):
+        super().__init__()
+        self.max_width = max_width
+        self.project_start = create_projection_layer(hidden_size, dropout)
+        self.project_end = create_projection_layer(hidden_size, dropout)
+        self.out_project = create_projection_layer(hidden_size * 2, dropout, hidden_size)
+    def forward(self, h: torch.Tensor, span_idx: torch.Tensor) -> torch.Tensor:
+        B, L, D = h.size()
+        start_rep = self.project_start(h)
+        end_rep = self.project_end(h)
+        start_span_rep = extract_elements(start_rep, span_idx[:, :, 0])
+        end_span_rep = extract_elements(end_rep, span_idx[:, :, 1])
+        cat = torch.cat([start_span_rep, end_span_rep], dim=-1).relu()
+        return self.out_project(cat).view(B, L, self.max_width, D)
+class ConvShareV2(nn.Module):
+    def __init__(self, hidden_size, max_width):
+        super().__init__()
+        self.max_width = max_width
+        self.conv_weigth = nn.Parameter(
+            torch.randn(hidden_size, hidden_size, max_width)
+        )
+        nn.init.xavier_normal_(self.conv_weigth)
+    def forward(self, x, *args):
+        span_reps = []
+        x = torch.einsum('bld->bdl', x)
+        for i in range(self.max_width):
+            pad = i
+            x_i = F.pad(x, (0, pad), "constant", 0)
+            conv_w = self.conv_weigth[:, :, :i + 1]
+            out_i = F.conv1d(x_i, conv_w)
+            span_reps.append(out_i.transpose(-1, -2))
+        out = torch.stack(span_reps, dim=-2)
+        return out
+class SpanRepLayer(nn.Module):
+    """
+    Various span representation approaches
+    """
+    def __init__(self, hidden_size, max_width, span_mode, **kwargs):
+        super().__init__()
+        if span_mode == 'marker':
+            self.span_rep_layer = SpanMarker(hidden_size, max_width, **kwargs)
+        elif span_mode == 'markerV0':
+            self.span_rep_layer = SpanMarkerV0(hidden_size, max_width, **kwargs)
+        elif span_mode == 'query':
+            self.span_rep_layer = SpanQuery(
+                hidden_size, max_width, trainable=True)
+        elif span_mode == 'mlp':
+            self.span_rep_layer = SpanMLP(hidden_size, max_width)
+        elif span_mode == 'cat':
+            self.span_rep_layer = SpanCAT(hidden_size, max_width)
+        elif span_mode == 'conv_conv':
+            self.span_rep_layer = SpanConv(
+                hidden_size, max_width, span_mode='conv_conv')
+        elif span_mode == 'conv_max':
+            self.span_rep_layer = SpanConv(
+                hidden_size, max_width, span_mode='conv_max')
+        elif span_mode == 'conv_mean':
+            self.span_rep_layer = SpanConv(
+                hidden_size, max_width, span_mode='conv_mean')
+        elif span_mode == 'conv_sum':
+            self.span_rep_layer = SpanConv(
+                hidden_size, max_width, span_mode='conv_sum')
+        elif span_mode == 'conv_share':
+            self.span_rep_layer = ConvShare(hidden_size, max_width)
+        else:
+            raise ValueError(f'Unknown span mode {span_mode}')
+    def forward(self, x, *args):
+        return self.span_rep_layer(x, *args)

GLiNER/modules/token_rep.py ADDED Viewed

	@@ -0,0 +1,54 @@

+from typing import List
+import torch
+from flair.data import Sentence
+from flair.embeddings import TransformerWordEmbeddings
+from torch import nn
+from torch.nn.utils.rnn import pad_sequence
+# flair.cache_root = '/gpfswork/rech/pds/upa43yu/.cache'
+class TokenRepLayer(nn.Module):
+    def __init__(self, model_name: str = "bert-base-cased", fine_tune: bool = True, subtoken_pooling: str = "first",
+                 hidden_size: int = 768,
+                 add_tokens=["[SEP]", "[ENT]"]
+                 ):
+        super().__init__()
+        self.bert_layer = TransformerWordEmbeddings(
+            model_name,
+            fine_tune=fine_tune,
+            subtoken_pooling=subtoken_pooling,
+            allow_long_sentences=True
+        )
+        # add tokens to vocabulary
+        self.bert_layer.tokenizer.add_tokens(add_tokens)
+        # resize token embeddings
+        self.bert_layer.model.resize_token_embeddings(len(self.bert_layer.tokenizer))
+        bert_hidden_size = self.bert_layer.embedding_length
+        if hidden_size != bert_hidden_size:
+            self.projection = nn.Linear(bert_hidden_size, hidden_size)
+    def forward(self, tokens: List[List[str]], lengths: torch.Tensor):
+        token_embeddings = self.compute_word_embedding(tokens)
+        if hasattr(self, "projection"):
+            token_embeddings = self.projection(token_embeddings)
+        B = len(lengths)
+        max_length = lengths.max()
+        mask = (torch.arange(max_length).view(1, -1).repeat(B, 1) < lengths.cpu().unsqueeze(1)).to(
+            token_embeddings.device).long()
+        return {"embeddings": token_embeddings, "mask": mask}
+    def compute_word_embedding(self, tokens):
+        sentences = [Sentence(i) for i in tokens]
+        self.bert_layer.embed(sentences)
+        token_embeddings = pad_sequence([torch.stack([t.embedding for t in k]) for k in sentences], batch_first=True)
+        return token_embeddings

GLiNER/requirements.txt ADDED Viewed

	@@ -0,0 +1,6 @@

+torch
+transformers
+huggingface_hub
+flair
+seqeval
+tqdm

GLiNER/save_load.py ADDED Viewed

	@@ -0,0 +1,20 @@

+import torch
+from model import GLiNER
+def save_model(current_model, path):
+    config = current_model.config
+    dict_save = {"model_weights": current_model.state_dict(), "config": config}
+    torch.save(dict_save, path)
+def load_model(path, model_name=None, device=None):
+    dict_load = torch.load(path, map_location=torch.device('cpu'))
+    config = dict_load["config"]
+    if model_name is not None:
+        config.model_name = model_name
+    loaded_model = GLiNER(config)
+    loaded_model.load_state_dict(dict_load["model_weights"])
+    return loaded_model.to(device) if device is not None else loaded_model

GLiNER/train.py ADDED Viewed

	@@ -0,0 +1,131 @@

+import argparse
+import os
+import torch
+import yaml
+from tqdm import tqdm
+from transformers import get_cosine_schedule_with_warmup
+# from model_nested import NerFilteredSemiCRF
+from model import GLiNER
+from modules.run_evaluation import get_for_all_path, sample_train_data
+from save_load import save_model, load_model
+import json
+# train function
+def train(model, optimizer, train_data, num_steps=1000, eval_every=100, log_dir="logs", warmup_ratio=0.1,
+          train_batch_size=8, device='cuda'):
+    model.train()
+    # initialize data loaders
+    train_loader = model.create_dataloader(train_data, batch_size=train_batch_size, shuffle=True)
+    pbar = tqdm(range(num_steps))
+    if warmup_ratio < 1:
+        num_warmup_steps = int(num_steps * warmup_ratio)
+    else:
+        num_warmup_steps = int(warmup_ratio)
+    scheduler = get_cosine_schedule_with_warmup(
+        optimizer,
+        num_warmup_steps=num_warmup_steps,
+        num_training_steps=num_steps
+    )
+    iter_train_loader = iter(train_loader)
+    for step in pbar:
+        try:
+            x = next(iter_train_loader)
+        except StopIteration:
+            iter_train_loader = iter(train_loader)
+            x = next(iter_train_loader)
+        for k, v in x.items():
+            if isinstance(v, torch.Tensor):
+                x[k] = v.to(device)
+        try:
+            loss = model(x)  # Forward pass
+        except:
+            continue
+        # check if loss is nan
+        if torch.isnan(loss):
+            continue
+        loss.backward()  # Compute gradients
+        optimizer.step()  # Update parameters
+        scheduler.step()  # Update learning rate schedule
+        optimizer.zero_grad()  # Reset gradients
+        description = f"step: {step} | epoch: {step // len(train_loader)} | loss: {loss.item():.2f}"
+        if (step + 1) % eval_every == 0:
+            current_path = os.path.join(log_dir, f'model_{step + 1}')
+            save_model(model, current_path)
+            #val_data_dir =  "/gpfswork/rech/ohy/upa43yu/NER_datasets" # can be obtained from "https://drive.google.com/file/d/1T-5IbocGka35I7X3CE6yKe5N_Xg2lVKT/view"
+            #get_for_all_path(model, step, log_dir, val_data_dir)  # you can remove this comment if you want to evaluate the model
+            model.train()
+        pbar.set_description(description)
+def create_parser():
+    parser = argparse.ArgumentParser(description="Span-based NER")
+    parser.add_argument("--config", type=str, default="config.yaml", help="Path to config file")
+    parser.add_argument('--log_dir', type=str, default='logs', help='Path to the log directory')
+    return parser
+def load_config_as_namespace(config_file):
+    with open(config_file, 'r') as f:
+        config_dict = yaml.safe_load(f)
+    return argparse.Namespace(**config_dict)
+if __name__ == "__main__":
+    # parse args
+    parser = create_parser()
+    args = parser.parse_args()
+    # load config
+    config = load_config_as_namespace(args.config)
+    config.log_dir = args.log_dir
+    try:
+        with open(config.train_data, 'r') as f:
+            data = json.load(f)
+    except:
+        data = sample_train_data(config.train_data, 10000)
+    if config.prev_path != "none":
+        model = load_model(config.prev_path)
+        model.config = config
+    else:
+        model = GLiNER(config)
+    if torch.cuda.is_available():
+        model = model.cuda()
+    lr_encoder = float(config.lr_encoder)
+    lr_others = float(config.lr_others)
+    optimizer = torch.optim.AdamW([
+        # encoder
+        {'params': model.token_rep_layer.parameters(), 'lr': lr_encoder},
+        {'params': model.rnn.parameters(), 'lr': lr_others},
+        # projection layers
+        {'params': model.span_rep_layer.parameters(), 'lr': lr_others},
+        {'params': model.prompt_rep_layer.parameters(), 'lr': lr_others},
+    ])
+    device = 'cuda' if torch.cuda.is_available() else 'cpu'
+    train(model, optimizer, data, num_steps=config.num_steps, eval_every=config.eval_every,
+          log_dir=config.log_dir, warmup_ratio=config.warmup_ratio, train_batch_size=config.train_batch_size,
+          device=device)