|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- darrow-ai/LegalLensNER |
|
language: |
|
- en |
|
metrics: |
|
- f1 |
|
pipeline_tag: token-classification |
|
library_name: sklearn |
|
tags: |
|
- ner |
|
- legal |
|
- crf |
|
--- |
|
# Model Card for Model ID |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
Conditional Random Field model for performing named entity recognition with hand crafted features. Named entities recognied - Violation-on, Violation-by, and Law. |
|
The dataset is of the BIO format. The model achieves an F1-score of 0.32. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
The model was developed for LegalLens 2024 competition as part of Natural Legal Language Processing 2024. The model has handcrafted features for identifying named |
|
entities in the BIO format. |
|
|
|
|
|
- **Developed by:** Shashank M Chakravarthy |
|
- **Funded by [optional]:** NA |
|
- **Shared by [optional]:** NA |
|
- **Model type:** Statistical Model |
|
- **Language(s) (NLP):** English |
|
- **License:** Apache 2.0 License |
|
- **Finetuned from model [optional]:** NA |
|
|
|
### Model Sources [optional] |
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
- **Repository:** NA |
|
- **Paper [optional]:** [https://aclanthology.org/2024.nllp-1.33.pdf] |
|
- **Demo [optional]:** NA |
|
|
|
## Uses |
|
|
|
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. --> |
|
The model is used to detect named entities in unstructured text. The model can be extended to other entities with further modification to the handcrafted features. |
|
|
|
### Direct Use |
|
|
|
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. --> |
|
|
|
The model can be directly used on any unstructured text with a bit of preprocessing. The files contain the evaluation script. |
|
|
|
### Downstream Use [optional] |
|
|
|
<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app --> |
|
|
|
|
|
### Out-of-Scope Use |
|
|
|
<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. --> |
|
This model is handcrafted for detecting violations and law in text. Can be used for other legal text which may contain similar entities. |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
<!-- This section is meant to convey both technical and sociotechnical limitations. --> |
|
|
|
The limitation comes with the handcrafting the features. |
|
|
|
### Recommendations |
|
|
|
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> |
|
|
|
If the text used for prediction is improperly processed without POS tags, the model will not perform as its designed to. |
|
|
|
## How to Get Started with the Model |
|
|
|
Use the code below to get started with the model. |
|
### Load libraries |
|
``` |
|
import ast |
|
import pandas as pd |
|
import joblib |
|
import nltk |
|
from nltk import pos_tag |
|
import string |
|
from nltk.stem import WordNetLemmatizer |
|
from nltk.stem import PorterStemmer |
|
``` |
|
|
|
### Check if nltk modules are downloaded, if not download them |
|
``` |
|
nltk.download('wordnet') |
|
nltk.download('omw-1.4') |
|
nltk.download("averaged_perceptron_tagger") |
|
``` |
|
### Class for grouping tokens as sentences (redundant if text processed directly) |
|
``` |
|
class getsentence(object): |
|
''' |
|
This class is used to get the sentences from the dataset. |
|
Converts from BIO format to sentences using their sentence numbers |
|
''' |
|
def __init__(self, data): |
|
self.n_sent = 1.0 |
|
self.data = data |
|
self.empty = False |
|
self.grouped = self.data.groupby("sentence_num").apply(self._agg_func) |
|
self.sentences = [s for s in self.grouped] |
|
|
|
def _agg_func(self, s): |
|
return [(w, p) for w, p in zip(s["token"].values.tolist(), |
|
s["pos_tag"].values.tolist())] |
|
|
|
``` |
|
### Creates features for words in a sentence (code can be reduced using iteration) |
|
``` |
|
def word2features(sent, i): |
|
''' |
|
This method is used to extract features from the words in the sentence. |
|
The main features extracted are: |
|
- word.lower(): The word in lowercase |
|
- word.isdigit(): If the word is a digit |
|
- word.punct(): If the word is a punctuation |
|
- postag: The pos tag of the word |
|
- word.lemma(): The lemma of the word |
|
- word.stem(): The stem of the word |
|
The features (not all) are also extracted for the 4 previous and 4 next words. |
|
''' |
|
global token_count |
|
wordnet_lemmatizer = WordNetLemmatizer() |
|
porter_stemmer = PorterStemmer() |
|
word = sent[i][0] |
|
postag = sent[i][1] |
|
|
|
features = { |
|
'bias': 1.0, |
|
'word.lower()': word.lower(), |
|
'word.isdigit()': word.isdigit(), |
|
# Check if its punctuations |
|
'word.punct()': word in string.punctuation, |
|
'postag': postag, |
|
# Lemma of the word |
|
'word.lemma()': wordnet_lemmatizer.lemmatize(word), |
|
# Stem of the word |
|
'word.stem()': porter_stemmer.stem(word) |
|
} |
|
if i > 0: |
|
word1 = sent[i-1][0] |
|
postag1 = sent[i-1][1] |
|
features.update({ |
|
'-1:word.lower()': word1.lower(), |
|
'-1:word.isdigit()': word1.isdigit(), |
|
'-1:word.punct()': word1 in string.punctuation, |
|
'-1:postag': postag1 |
|
}) |
|
if i - 2 >= 0: |
|
features.update({ |
|
'-2:word.lower()': sent[i-2][0].lower(), |
|
'-2:word.isdigit()': sent[i-2][0].isdigit(), |
|
'-2:word.punct()': sent[i-2][0] in string.punctuation, |
|
'-2:postag': sent[i-2][1] |
|
}) |
|
if i - 3 >= 0: |
|
features.update({ |
|
'-3:word.lower()': sent[i-3][0].lower(), |
|
'-3:word.isdigit()': sent[i-3][0].isdigit(), |
|
'-3:word.punct()': sent[i-3][0] in string.punctuation, |
|
'-3:postag': sent[i-3][1] |
|
}) |
|
if i - 4 >= 0: |
|
features.update({ |
|
'-4:word.lower()': sent[i-4][0].lower(), |
|
'-4:word.isdigit()': sent[i-4][0].isdigit(), |
|
'-4:word.punct()': sent[i-4][0] in string.punctuation, |
|
'-4:postag': sent[i-4][1] |
|
}) |
|
else: |
|
features['BOS'] = True |
|
|
|
if i < len(sent)-1: |
|
word1 = sent[i+1][0] |
|
postag1 = sent[i+1][1] |
|
features.update({ |
|
'+1:word.lower()': word1.lower(), |
|
'+1:word.isdigit()': word1.isdigit(), |
|
'+1:word.punct()': word1 in string.punctuation, |
|
'+1:postag': postag1 |
|
}) |
|
if i + 2 < len(sent): |
|
features.update({ |
|
'+2:word.lower()': sent[i+2][0].lower(), |
|
'+2:word.isdigit()': sent[i+2][0].isdigit(), |
|
'+2:word.punct()': sent[i+2][0] in string.punctuation, |
|
'+2:postag': sent[i+2][1] |
|
}) |
|
if i + 3 < len(sent): |
|
features.update({ |
|
'+3:word.lower()': sent[i+3][0].lower(), |
|
'+3:word.isdigit()': sent[i+3][0].isdigit(), |
|
'+3:word.punct()': sent[i+3][0] in string.punctuation, |
|
'+3:postag': sent[i+3][1] |
|
}) |
|
if i + 4 < len(sent): |
|
features.update({ |
|
'+4:word.lower()': sent[i+4][0].lower(), |
|
'+4:word.isdigit()': sent[i+4][0].isdigit(), |
|
'+4:word.punct()': sent[i+4][0] in string.punctuation, |
|
'+4:postag': sent[i+4][1] |
|
}) |
|
else: |
|
features['EOS'] = True |
|
|
|
return features |
|
``` |
|
### Obtain features for a given sentence |
|
``` |
|
def sent2features(sent): |
|
''' |
|
This method is used to extract features from the sentence. |
|
''' |
|
return [word2features(sent, i) for i in range(len(sent))] |
|
``` |
|
### Load file from your directory |
|
``` |
|
df_eval = pd.read_excel("testset_NER_LegalLens.xlsx") |
|
``` |
|
### Evaluate data type and create pos_tags for each token |
|
``` |
|
df_eval["tokens"] = df_eval["tokens"].apply(ast.literal_eval) |
|
df_eval['pos_tags'] = df_eval['tokens'].apply(lambda x: [tag[1] |
|
for tag in pos_tag(x)]) |
|
``` |
|
### Aggregate tokens to sentences |
|
``` |
|
data_eval = [] |
|
for i in range(len(df_eval)): |
|
for j in range(len(df_eval["tokens"][i])): |
|
data_eval.append( |
|
{ |
|
"sentence_num": i+1, |
|
"id": df_eval["id"][i], |
|
"token": df_eval["tokens"][i][j], |
|
"pos_tag": df_eval["pos_tags"][i][j], |
|
} |
|
) |
|
data_eval = pd.DataFrame(data_eval) |
|
getter = getsentence(data_eval) |
|
sentences_eval = getter.sentences |
|
X_eval = [sent2features(s) for s in sentences_eval] |
|
``` |
|
### Load model from your directory |
|
``` |
|
crf = joblib.load("../models/crf.pkl") |
|
y_pred_eval = crf.predict(X_eval) |
|
print("NER tags predicted.") |
|
df_eval["ner_tags"] = y_pred_eval |
|
df_eval.drop(columns=["pos_tags"], inplace=True) |
|
print("Saving the predictions...") |
|
df_eval.to_csv("predictions_NERLens.csv", index=False) |
|
print("Predictions saved.") |
|
``` |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. --> |
|
|
|
[https://huggingface.co/datasets/darrow-ai/LegalLensNER] |
|
|
|
### Training Procedure |
|
|
|
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. --> |
|
The dataset was first evaluated for its datatype, POS_tags were created for each token in the text. With handcrafted features, |
|
the model was trained on a CPU. Training time is around 20-30 minutes for this dataset. |
|
#### Preprocessing [optional] |
|
For every token, POS_tags were assigned using NLTK library. |
|
|
|
|
|
#### Training Hyperparameters |
|
|
|
- **Training regime:** NA <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision --> |
|
|
|
#### Speeds, Sizes, Times [optional] |
|
|
|
<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. --> |
|
NA |
|
|
|
## Evaluation |
|
|
|
<!-- This section describes the evaluation protocols and provides the results. --> |
|
The model was evaluated using macro-F1 score. A score of 0.32 was obtained on unseen test data. |
|
|
|
### Testing Data, Factors & Metrics |
|
|
|
#### Testing Data |
|
|
|
<!-- This should link to a Dataset Card if possible. --> |
|
|
|
[https://huggingface.co/datasets/darrow-ai/LegalLensNER] |
|
|
|
#### Factors |
|
|
|
<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. --> |
|
|
|
[More Information Needed] |
|
|
|
#### Metrics |
|
|
|
<!-- These are the evaluation metrics being used, ideally with a description of why. --> |
|
|
|
Macro-F1 score as it evaluates the true performance of the model and mitigates the performance boost created by highly skewed entities in the dataset. |
|
|
|
### Results |
|
|
|
0.32 macro-F1 score on unseen data. |
|
|
|
#### Summary |
|
|
|
The model was designed and developed to tackle NER task in unstructured text. |
|
|
|
## Model Examination [optional] |
|
|
|
<!-- Relevant interpretability work for the model goes here --> |
|
NA |
|
|
|
## Environmental Impact |
|
|
|
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly --> |
|
|
|
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). |
|
|
|
- **Hardware Type:** 13th Gen Intel(R) Core(TM) i7-1365U |
|
- **Hours used:** 0.5 hours |
|
- **Cloud Provider:** NA |
|
- **Compute Region:** NA |
|
- **Carbon Emitted:** Unknown |
|
|
|
## Technical Specifications [optional] |
|
|
|
### Model Architecture and Objective |
|
|
|
[More Information Needed] |
|
|
|
### Compute Infrastructure |
|
|
|
[More Information Needed] |
|
|
|
#### Hardware |
|
|
|
[More Information Needed] |
|
|
|
#### Software |
|
|
|
[More Information Needed] |
|
|
|
## Citation [optional] |
|
|
|
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. --> |
|
|
|
**BibTeX:** |
|
|
|
[More Information Needed] |
|
|
|
**APA:** |
|
|
|
[More Information Needed] |
|
|
|
## Glossary [optional] |
|
|
|
<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. --> |
|
|
|
[More Information Needed] |
|
|
|
## More Information [optional] |
|
|
|
[More Information Needed] |
|
|
|
## Model Card Authors [optional] |
|
|
|
[More Information Needed] |
|
|
|
## Model Card Contact |
|
|
|
[More Information Needed] |