Token Classification
Scikit-learn
English
ner
legal
crf
shashankmc's picture
Readme.md file updated
6830d6b verified
---
license: apache-2.0
datasets:
- darrow-ai/LegalLensNER
language:
- en
metrics:
- f1
pipeline_tag: token-classification
library_name: sklearn
tags:
- ner
- legal
- crf
---
# Model Card for Model ID
<!-- Provide a quick summary of what the model is/does. -->
Conditional Random Field model for performing named entity recognition with hand crafted features. Named entities recognied - Violation-on, Violation-by, and Law.
The dataset is of the BIO format. The model achieves an F1-score of 0.32.
## Model Details
### Model Description
<!-- Provide a longer summary of what this model is. -->
The model was developed for LegalLens 2024 competition as part of Natural Legal Language Processing 2024. The model has handcrafted features for identifying named
entities in the BIO format.
- **Developed by:** Shashank M Chakravarthy
- **Funded by [optional]:** NA
- **Shared by [optional]:** NA
- **Model type:** Statistical Model
- **Language(s) (NLP):** English
- **License:** Apache 2.0 License
- **Finetuned from model [optional]:** NA
### Model Sources [optional]
<!-- Provide the basic links for the model. -->
- **Repository:** NA
- **Paper [optional]:** [https://aclanthology.org/2024.nllp-1.33.pdf]
- **Demo [optional]:** NA
## Uses
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
The model is used to detect named entities in unstructured text. The model can be extended to other entities with further modification to the handcrafted features.
### Direct Use
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
The model can be directly used on any unstructured text with a bit of preprocessing. The files contain the evaluation script.
### Downstream Use [optional]
<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
### Out-of-Scope Use
<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
This model is handcrafted for detecting violations and law in text. Can be used for other legal text which may contain similar entities.
## Bias, Risks, and Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
The limitation comes with the handcrafting the features.
### Recommendations
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
If the text used for prediction is improperly processed without POS tags, the model will not perform as its designed to.
## How to Get Started with the Model
Use the code below to get started with the model.
### Load libraries
```
import ast
import pandas as pd
import joblib
import nltk
from nltk import pos_tag
import string
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
```
### Check if nltk modules are downloaded, if not download them
```
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download("averaged_perceptron_tagger")
```
### Class for grouping tokens as sentences (redundant if text processed directly)
```
class getsentence(object):
'''
This class is used to get the sentences from the dataset.
Converts from BIO format to sentences using their sentence numbers
'''
def __init__(self, data):
self.n_sent = 1.0
self.data = data
self.empty = False
self.grouped = self.data.groupby("sentence_num").apply(self._agg_func)
self.sentences = [s for s in self.grouped]
def _agg_func(self, s):
return [(w, p) for w, p in zip(s["token"].values.tolist(),
s["pos_tag"].values.tolist())]
```
### Creates features for words in a sentence (code can be reduced using iteration)
```
def word2features(sent, i):
'''
This method is used to extract features from the words in the sentence.
The main features extracted are:
- word.lower(): The word in lowercase
- word.isdigit(): If the word is a digit
- word.punct(): If the word is a punctuation
- postag: The pos tag of the word
- word.lemma(): The lemma of the word
- word.stem(): The stem of the word
The features (not all) are also extracted for the 4 previous and 4 next words.
'''
global token_count
wordnet_lemmatizer = WordNetLemmatizer()
porter_stemmer = PorterStemmer()
word = sent[i][0]
postag = sent[i][1]
features = {
'bias': 1.0,
'word.lower()': word.lower(),
'word.isdigit()': word.isdigit(),
# Check if its punctuations
'word.punct()': word in string.punctuation,
'postag': postag,
# Lemma of the word
'word.lemma()': wordnet_lemmatizer.lemmatize(word),
# Stem of the word
'word.stem()': porter_stemmer.stem(word)
}
if i > 0:
word1 = sent[i-1][0]
postag1 = sent[i-1][1]
features.update({
'-1:word.lower()': word1.lower(),
'-1:word.isdigit()': word1.isdigit(),
'-1:word.punct()': word1 in string.punctuation,
'-1:postag': postag1
})
if i - 2 >= 0:
features.update({
'-2:word.lower()': sent[i-2][0].lower(),
'-2:word.isdigit()': sent[i-2][0].isdigit(),
'-2:word.punct()': sent[i-2][0] in string.punctuation,
'-2:postag': sent[i-2][1]
})
if i - 3 >= 0:
features.update({
'-3:word.lower()': sent[i-3][0].lower(),
'-3:word.isdigit()': sent[i-3][0].isdigit(),
'-3:word.punct()': sent[i-3][0] in string.punctuation,
'-3:postag': sent[i-3][1]
})
if i - 4 >= 0:
features.update({
'-4:word.lower()': sent[i-4][0].lower(),
'-4:word.isdigit()': sent[i-4][0].isdigit(),
'-4:word.punct()': sent[i-4][0] in string.punctuation,
'-4:postag': sent[i-4][1]
})
else:
features['BOS'] = True
if i < len(sent)-1:
word1 = sent[i+1][0]
postag1 = sent[i+1][1]
features.update({
'+1:word.lower()': word1.lower(),
'+1:word.isdigit()': word1.isdigit(),
'+1:word.punct()': word1 in string.punctuation,
'+1:postag': postag1
})
if i + 2 < len(sent):
features.update({
'+2:word.lower()': sent[i+2][0].lower(),
'+2:word.isdigit()': sent[i+2][0].isdigit(),
'+2:word.punct()': sent[i+2][0] in string.punctuation,
'+2:postag': sent[i+2][1]
})
if i + 3 < len(sent):
features.update({
'+3:word.lower()': sent[i+3][0].lower(),
'+3:word.isdigit()': sent[i+3][0].isdigit(),
'+3:word.punct()': sent[i+3][0] in string.punctuation,
'+3:postag': sent[i+3][1]
})
if i + 4 < len(sent):
features.update({
'+4:word.lower()': sent[i+4][0].lower(),
'+4:word.isdigit()': sent[i+4][0].isdigit(),
'+4:word.punct()': sent[i+4][0] in string.punctuation,
'+4:postag': sent[i+4][1]
})
else:
features['EOS'] = True
return features
```
### Obtain features for a given sentence
```
def sent2features(sent):
'''
This method is used to extract features from the sentence.
'''
return [word2features(sent, i) for i in range(len(sent))]
```
### Load file from your directory
```
df_eval = pd.read_excel("testset_NER_LegalLens.xlsx")
```
### Evaluate data type and create pos_tags for each token
```
df_eval["tokens"] = df_eval["tokens"].apply(ast.literal_eval)
df_eval['pos_tags'] = df_eval['tokens'].apply(lambda x: [tag[1]
for tag in pos_tag(x)])
```
### Aggregate tokens to sentences
```
data_eval = []
for i in range(len(df_eval)):
for j in range(len(df_eval["tokens"][i])):
data_eval.append(
{
"sentence_num": i+1,
"id": df_eval["id"][i],
"token": df_eval["tokens"][i][j],
"pos_tag": df_eval["pos_tags"][i][j],
}
)
data_eval = pd.DataFrame(data_eval)
getter = getsentence(data_eval)
sentences_eval = getter.sentences
X_eval = [sent2features(s) for s in sentences_eval]
```
### Load model from your directory
```
crf = joblib.load("../models/crf.pkl")
y_pred_eval = crf.predict(X_eval)
print("NER tags predicted.")
df_eval["ner_tags"] = y_pred_eval
df_eval.drop(columns=["pos_tags"], inplace=True)
print("Saving the predictions...")
df_eval.to_csv("predictions_NERLens.csv", index=False)
print("Predictions saved.")
```
## Training Details
### Training Data
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
[https://huggingface.co/datasets/darrow-ai/LegalLensNER]
### Training Procedure
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
The dataset was first evaluated for its datatype, POS_tags were created for each token in the text. With handcrafted features,
the model was trained on a CPU. Training time is around 20-30 minutes for this dataset.
#### Preprocessing [optional]
For every token, POS_tags were assigned using NLTK library.
#### Training Hyperparameters
- **Training regime:** NA <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
#### Speeds, Sizes, Times [optional]
<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
NA
## Evaluation
<!-- This section describes the evaluation protocols and provides the results. -->
The model was evaluated using macro-F1 score. A score of 0.32 was obtained on unseen test data.
### Testing Data, Factors & Metrics
#### Testing Data
<!-- This should link to a Dataset Card if possible. -->
[https://huggingface.co/datasets/darrow-ai/LegalLensNER]
#### Factors
<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
[More Information Needed]
#### Metrics
<!-- These are the evaluation metrics being used, ideally with a description of why. -->
Macro-F1 score as it evaluates the true performance of the model and mitigates the performance boost created by highly skewed entities in the dataset.
### Results
0.32 macro-F1 score on unseen data.
#### Summary
The model was designed and developed to tackle NER task in unstructured text.
## Model Examination [optional]
<!-- Relevant interpretability work for the model goes here -->
NA
## Environmental Impact
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
- **Hardware Type:** 13th Gen Intel(R) Core(TM) i7-1365U
- **Hours used:** 0.5 hours
- **Cloud Provider:** NA
- **Compute Region:** NA
- **Carbon Emitted:** Unknown
## Technical Specifications [optional]
### Model Architecture and Objective
[More Information Needed]
### Compute Infrastructure
[More Information Needed]
#### Hardware
[More Information Needed]
#### Software
[More Information Needed]
## Citation [optional]
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
**BibTeX:**
[More Information Needed]
**APA:**
[More Information Needed]
## Glossary [optional]
<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
[More Information Needed]
## More Information [optional]
[More Information Needed]
## Model Card Authors [optional]
[More Information Needed]
## Model Card Contact
[More Information Needed]