LizaKovtun
commited on
Commit
•
eb9285d
1
Parent(s):
1129006
Update README.md
Browse files
README.md
CHANGED
@@ -5,17 +5,46 @@ tags:
|
|
5 |
- finance
|
6 |
language:
|
7 |
- en
|
8 |
-
|
9 |
---
|
10 |
-
|
11 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
12 |
|
13 |
-
## Usage
|
14 |
```python
|
15 |
from collections import OrderedDict
|
16 |
from transformers import MPNetPreTrainedModel, MPNetModel, AutoTokenizer
|
17 |
import torch
|
18 |
-
|
|
|
19 |
def mean_pooling(model_output, attention_mask):
|
20 |
token_embeddings = model_output #First element of model_output contains all token embeddings
|
21 |
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
|
@@ -42,8 +71,6 @@ class ESGify(MPNetPreTrainedModel):
|
|
42 |
|
43 |
|
44 |
def forward(self, input_ids, attention_mask):
|
45 |
-
|
46 |
-
|
47 |
# Feed input to mpnet model
|
48 |
outputs = self.mpnet(input_ids=input_ids,
|
49 |
attention_mask=attention_mask)
|
@@ -54,65 +81,21 @@ class ESGify(MPNetPreTrainedModel):
|
|
54 |
# apply sigmoid
|
55 |
logits = 1.0 / (1.0 + torch.exp(-logits))
|
56 |
return logits
|
|
|
|
|
|
|
57 |
|
|
|
58 |
model = ESGify.from_pretrained('ai-lab/ESGify')
|
59 |
tokenizer = AutoTokenizer.from_pretrained('ai-lab/ESGify')
|
60 |
-
|
61 |
-
to_model = tokenizer.batch_encode_plus(
|
62 |
-
texts,
|
63 |
-
add_special_tokens=True,
|
64 |
-
max_length=512,
|
65 |
-
return_token_type_ids=False,
|
66 |
-
padding="max_length",
|
67 |
-
truncation=True,
|
68 |
-
return_attention_mask=True,
|
69 |
-
return_tensors='pt',
|
70 |
-
)
|
71 |
-
results = model(**to_model)
|
72 |
-
|
73 |
|
74 |
-
|
75 |
-
|
76 |
-
from flair.data import Sentence
|
77 |
-
from flair.nn import Classifier
|
78 |
-
from torch.utils.data import DataLoader
|
79 |
-
from nltk.corpus import stopwords
|
80 |
-
from nltk.tokenize import word_tokenize
|
81 |
-
|
82 |
-
stop_words = set(stopwords.words('english'))
|
83 |
-
tagger = Classifier.load('ner-ontonotes-large')
|
84 |
-
tag_list = ['FAC','LOC','ORG','PERSON']
|
85 |
-
texts_with_masks = []
|
86 |
-
for example_sent in texts:
|
87 |
-
filtered_sentence = []
|
88 |
-
word_tokens = word_tokenize(example_sent)
|
89 |
-
# converts the words in word_tokens to lower case and then checks whether
|
90 |
-
#they are present in stop_words or not
|
91 |
-
for w in word_tokens:
|
92 |
-
if w.lower() not in stop_words:
|
93 |
-
filtered_sentence.append(w)
|
94 |
-
# make a sentence
|
95 |
-
sentence = Sentence(' '.join(filtered_sentence))
|
96 |
-
# run NER over sentence
|
97 |
-
tagger.predict(sentence)
|
98 |
-
sent = ' '.join(filtered_sentence)
|
99 |
-
k = 0
|
100 |
-
new_string = ''
|
101 |
-
start_t = 0
|
102 |
-
for i in sentence.get_labels():
|
103 |
-
info = i.to_dict()
|
104 |
-
val = info['value']
|
105 |
-
if info['confidence']>0.8 and val in tag_list :
|
106 |
-
|
107 |
-
if i.data_point.start_position>start_t :
|
108 |
-
new_string+=sent[start_t:i.data_point.start_position]
|
109 |
-
start_t = i.data_point.end_position
|
110 |
-
new_string+= f'<{val}>'
|
111 |
-
new_string+=sent[start_t:-1]
|
112 |
-
texts_with_masks.append(new_string)
|
113 |
|
|
|
|
|
114 |
to_model = tokenizer.batch_encode_plus(
|
115 |
-
|
116 |
add_special_tokens=True,
|
117 |
max_length=512,
|
118 |
return_token_type_ids=False,
|
@@ -124,21 +107,27 @@ to_model = tokenizer.batch_encode_plus(
|
|
124 |
results = model(**to_model)
|
125 |
```
|
126 |
|
127 |
-
|
128 |
-
|
129 |
-
## Background
|
130 |
-
|
131 |
-
The project aims to develop the ESG Risks classification model with a custom ESG risks definition methodology.
|
132 |
|
|
|
|
|
|
|
|
|
133 |
|
134 |
-
|
|
|
|
|
|
|
|
|
|
|
135 |
|
136 |
-
|
|
|
137 |
|
138 |
-
We use the pretrained [`microsoft/mpnet-base`](https://huggingface.co/microsoft/mpnet-base) model.
|
139 |
-
Next, we do the domain-adaptation procedure by Mask Language Modeling pertaining with using texts of ESG reports.
|
140 |
|
|
|
141 |
|
142 |
-
|
|
|
|
|
143 |
|
144 |
-
We use the ESG news dataset of 2000 texts with manually annotation of ESG specialists.
|
|
|
5 |
- finance
|
6 |
language:
|
7 |
- en
|
8 |
+
|
9 |
---
|
10 |
+
# About ESGify
|
11 |
+
**ESGify** is a model for multilabel news classification with respect to ESG risks. Our custom methodology includes 46 ESG classes and 1 non-relevant to ESG class, resulting in 47 classes in total:
|
12 |
+
|
13 |
+
| E | S | G |
|
14 |
+
| ----------- | ----------- | ----------- |
|
15 |
+
| **Biodiversity** | **Communities Health and Safety** | **Legal Proceedings & Law Violations** |
|
16 |
+
| **Emergencies (Environmental)** | **Land Acquisition and Resettlement (S)** | **Corporate Governance** |
|
17 |
+
| **Hazardous Materials Management** | **Emergencies (Social)** | **Responsible Investment & Greenwashing** |
|
18 |
+
| **Environmental Management** | **Human Rights** | **Economic Crime** |
|
19 |
+
| **Landscape Transformation** | **Labor Relations Management** | **Disclosure** |
|
20 |
+
| **Climate Risks** | **Freedom of Association and Right to Organise** | **Values and Ethics** |
|
21 |
+
| **Surface Water Pollution** | **Employee Health and Safety** | **Risk Management and Internal Control** |
|
22 |
+
| **Animal Welfare** | **Product Safety and Quality** | **Strategy Implementation** |
|
23 |
+
| **Water Consumption** | **Indigenous People** | **Supply Chain (Economic / Governance)** |
|
24 |
+
| **Greenhouse Gas Emissions** | **Cultural Heritage** ||
|
25 |
+
| **Air Pollution** | **Forced Labour** ||
|
26 |
+
| **Waste Management** | **Supply Chain (Social)** ||
|
27 |
+
| **Soil and Groundwater Impact** | **Discrimination** ||
|
28 |
+
| **Wastewater Management** | **Minimum Age and Child Labour** ||
|
29 |
+
| **Natural Resources** | **Data Safety** ||
|
30 |
+
| **Physical Impacts** | **Retrenchment** ||
|
31 |
+
| **Supply Chain (Environmental)** |||
|
32 |
+
| **Planning Limitations** |||
|
33 |
+
| **Energy Efficiency and Renewables** |||
|
34 |
+
| **Land Acquisition and Resettlement (E)** |||
|
35 |
+
| **Land Rehabilitation** |||
|
36 |
+
|
37 |
+
|
38 |
+
# Usage
|
39 |
+
|
40 |
+
ESGify is based on MPNet architecture but with a custom classification head. The ESGify class is defined is follows.
|
41 |
|
|
|
42 |
```python
|
43 |
from collections import OrderedDict
|
44 |
from transformers import MPNetPreTrainedModel, MPNetModel, AutoTokenizer
|
45 |
import torch
|
46 |
+
|
47 |
+
# Mean Pooling - Take attention mask into account for correct averaging
|
48 |
def mean_pooling(model_output, attention_mask):
|
49 |
token_embeddings = model_output #First element of model_output contains all token embeddings
|
50 |
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
|
|
|
71 |
|
72 |
|
73 |
def forward(self, input_ids, attention_mask):
|
|
|
|
|
74 |
# Feed input to mpnet model
|
75 |
outputs = self.mpnet(input_ids=input_ids,
|
76 |
attention_mask=attention_mask)
|
|
|
81 |
# apply sigmoid
|
82 |
logits = 1.0 / (1.0 + torch.exp(-logits))
|
83 |
return logits
|
84 |
+
```
|
85 |
+
|
86 |
+
After defining model class, we initialize ESGify and tokenizer with the pre-trained weights
|
87 |
|
88 |
+
```python
|
89 |
model = ESGify.from_pretrained('ai-lab/ESGify')
|
90 |
tokenizer = AutoTokenizer.from_pretrained('ai-lab/ESGify')
|
91 |
+
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
92 |
|
93 |
+
Getting results from the model:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
94 |
|
95 |
+
```python
|
96 |
+
texts = ['text1','text2']
|
97 |
to_model = tokenizer.batch_encode_plus(
|
98 |
+
texts,
|
99 |
add_special_tokens=True,
|
100 |
max_length=512,
|
101 |
return_token_type_ids=False,
|
|
|
107 |
results = model(**to_model)
|
108 |
```
|
109 |
|
110 |
+
To identify top-3 classes by relevance and their scores:
|
|
|
|
|
|
|
|
|
111 |
|
112 |
+
```python
|
113 |
+
for i in torch.topk(results, k=3).indices.tolist()[0]:
|
114 |
+
print(f"{model.id2label[i]}: {np.round(results.flatten()[i].item(), 3)}")
|
115 |
+
```
|
116 |
|
117 |
+
For example, for the news "She faced employment rejection because of her gender", we get the following top-3 labels:
|
118 |
+
```
|
119 |
+
Discrimination: 0.944
|
120 |
+
Strategy Implementation: 0.82
|
121 |
+
Indigenous People: 0.499
|
122 |
+
```
|
123 |
|
124 |
+
Before training our model, we masked words related to Organisation, Date, Country, and Person to prevent false associations between these entities and risks. Hence, we recommend to process text with FLAIR NER model before inference.
|
125 |
+
An example of such preprocessing is given in https://colab.research.google.com/drive/15YcTW9KPSWesZ6_L4BUayqW_omzars0l?usp=sharing.
|
126 |
|
|
|
|
|
127 |
|
128 |
+
# Training procedure
|
129 |
|
130 |
+
We use the pretrained [`microsoft/mpnet-base`](https://huggingface.co/microsoft/mpnet-base) model.
|
131 |
+
Next, we do the domain-adaptation procedure by Mask Language Modeling with using texts of ESG reports.
|
132 |
+
Finally, we fine-tune our model on 2000 texts with manually annotation of ESG specialists.
|
133 |
|
|