File size: 3,999 Bytes
149748c
 
e26fed1
 
149748c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0aff43f
 
149748c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
---
license: mit
datasets:
- newsmediabias/BIAS-CONLL
---

---
license: mit
language:
- en
---

# Named entity recognition

## Model Description

This model is a fine-tuned token classification model designed to predict entities in sentences. 
It's fine-tuned on a custom dataset that focuses on identifying certain types of entities, including biases in text.

## Intended Use

The model is intended to be used for entity recognition tasks, especially for identifying biases in text passages.
Users can input a sequence of text, and the model will highlight words or tokens or **spans** it believes are associated with a particular entity or bias.

https://www.sciencedirect.com/science/article/abs/pii/S0957417423020444

## How to Use

The model can be used for inference directly through the Hugging Face `transformers` library:

```python


from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch

device = torch.device("cpu")

# Load model directly
from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("newsmediabias/UnBIAS-NER")
model = AutoModelForTokenClassification.from_pretrained("newsmediabias/UnBIAS-NER")

def highlight_biased_entities(sentence):
    tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sentence)))
    inputs = tokenizer.encode(sentence, return_tensors="pt")
    inputs = inputs.to(device)

    outputs = model(inputs).logits
    predictions = torch.argmax(outputs, dim=2)

    id2label = model.config.id2label

    # Reconstruct words from subword tokens and highlight them
    highlighted_sentence = ""
    current_word = ""
    is_biased = False
    for token, prediction in zip(tokens, predictions[0]):
        label = id2label[prediction.item()]
        if label in ['B-BIAS', 'I-BIAS']:
            if token.startswith('##'):
                current_word += token[2:]
            else:
                if current_word:
                    if is_biased:
                        highlighted_sentence += f"BIAS[{current_word}] "
                    else:
                        highlighted_sentence += f"{current_word} "
                    current_word = token
                else:
                    current_word = token
                is_biased = True
        else:
            if current_word:
                if is_biased:
                    highlighted_sentence += f"BIAS[{current_word}] "
                else:
                    highlighted_sentence += f"{current_word} "
                current_word = ""
            highlighted_sentence += f"{token} "
            is_biased = False
    if current_word:
        if is_biased:
            highlighted_sentence += f"BIAS[{current_word}]"
        else:
            highlighted_sentence += current_word

    # Filter out special tokens and subword tokens
    highlighted_sentence = highlighted_sentence.replace(' [', '[').replace(' ]', ']').replace(' ##', '')

    return highlighted_sentence

sentence = "due to your evil and dishonest nature, i am kind of tired and want to get rid of such cheapters. all people like you are evil and a disgrace to society and I must say to get rid of immigrants as they are filthy to culture"
highlighted_sentence = highlight_biased_entities(sentence)
print(highlighted_sentence)




```


## Limitations and Biases

Every model has limitations, and it's crucial to understand these when deploying models in real-world scenarios:

1. **Training Data**: The model is trained on a specific dataset, and its predictions are only as good as the data it's trained on.
2. **Generalization**: While the model may perform well on certain types of sentences or phrases, it might not generalize well to all types of text or contexts.

It's also essential to be aware of any potential biases in the training data, which might affect the model's predictions.

## Training Data

The model was fine-tuned on a custom dataset. Ask **Shaina Raza [email protected]** for dataset