crf_ner_violations_legallens / README.md

Readme.md file updated

6830d6b verified 15 days ago

12.4 kB

	---
	license: apache-2.0
	datasets:
	- darrow-ai/LegalLensNER
	language:
	- en
	metrics:
	- f1
	pipeline_tag: token-classification
	library_name: sklearn
	tags:
	- ner
	- legal
	- crf
	---
	# Model Card for Model ID

	<!-- Provide a quick summary of what the model is/does. -->
	Conditional Random Field model for performing named entity recognition with hand crafted features. Named entities recognied - Violation-on, Violation-by, and Law.
	The dataset is of the BIO format. The model achieves an F1-score of 0.32.

	## Model Details

	### Model Description

	<!-- Provide a longer summary of what this model is. -->
	The model was developed for LegalLens 2024 competition as part of Natural Legal Language Processing 2024. The model has handcrafted features for identifying named
	entities in the BIO format.


	- Developed by: Shashank M Chakravarthy
	- Funded by [optional]: NA
	- Shared by [optional]: NA
	- Model type: Statistical Model
	- Language(s) (NLP): English
	- License: Apache 2.0 License
	- Finetuned from model [optional]: NA

	### Model Sources [optional]

	<!-- Provide the basic links for the model. -->

	- Repository: NA
	- Paper [optional]: [https://aclanthology.org/2024.nllp-1.33.pdf]
	- Demo [optional]: NA

	## Uses

	<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
	The model is used to detect named entities in unstructured text. The model can be extended to other entities with further modification to the handcrafted features.

	### Direct Use

	<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->

	The model can be directly used on any unstructured text with a bit of preprocessing. The files contain the evaluation script.

	### Downstream Use [optional]

	<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->


	### Out-of-Scope Use

	<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
	This model is handcrafted for detecting violations and law in text. Can be used for other legal text which may contain similar entities.

	## Bias, Risks, and Limitations

	<!-- This section is meant to convey both technical and sociotechnical limitations. -->

	The limitation comes with the handcrafting the features.

	### Recommendations

	<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->

	If the text used for prediction is improperly processed without POS tags, the model will not perform as its designed to.

	## How to Get Started with the Model

	Use the code below to get started with the model.
	### Load libraries
	```
	import ast
	import pandas as pd
	import joblib
	import nltk
	from nltk import pos_tag
	import string
	from nltk.stem import WordNetLemmatizer
	from nltk.stem import PorterStemmer
	```

	### Check if nltk modules are downloaded, if not download them
	```
	nltk.download('wordnet')
	nltk.download('omw-1.4')
	nltk.download("averaged_perceptron_tagger")
	```
	### Class for grouping tokens as sentences (redundant if text processed directly)
	```
	class getsentence(object):
	'''
	This class is used to get the sentences from the dataset.
	Converts from BIO format to sentences using their sentence numbers
	'''
	def __init__(self, data):
	self.n_sent = 1.0
	self.data = data
	self.empty = False
	self.grouped = self.data.groupby("sentence_num").apply(self._agg_func)
	self.sentences = [s for s in self.grouped]

	def _agg_func(self, s):
	return [(w, p) for w, p in zip(s["token"].values.tolist(),
	s["pos_tag"].values.tolist())]

	```
	### Creates features for words in a sentence (code can be reduced using iteration)
	```
	def word2features(sent, i):
	'''
	This method is used to extract features from the words in the sentence.
	The main features extracted are:
	- word.lower(): The word in lowercase
	- word.isdigit(): If the word is a digit
	- word.punct(): If the word is a punctuation
	- postag: The pos tag of the word
	- word.lemma(): The lemma of the word
	- word.stem(): The stem of the word
	The features (not all) are also extracted for the 4 previous and 4 next words.
	'''
	global token_count
	wordnet_lemmatizer = WordNetLemmatizer()
	porter_stemmer = PorterStemmer()
	word = sent[i][0]
	postag = sent[i][1]

	features = {
	'bias': 1.0,
	'word.lower()': word.lower(),
	'word.isdigit()': word.isdigit(),
	# Check if its punctuations
	'word.punct()': word in string.punctuation,
	'postag': postag,
	# Lemma of the word
	'word.lemma()': wordnet_lemmatizer.lemmatize(word),
	# Stem of the word
	'word.stem()': porter_stemmer.stem(word)
	}
	if i > 0:
	word1 = sent[i-1][0]
	postag1 = sent[i-1][1]
	features.update({
	'-1:word.lower()': word1.lower(),
	'-1:word.isdigit()': word1.isdigit(),
	'-1:word.punct()': word1 in string.punctuation,
	'-1:postag': postag1
	})
	if i - 2 >= 0:
	features.update({
	'-2:word.lower()': sent[i-2][0].lower(),
	'-2:word.isdigit()': sent[i-2][0].isdigit(),
	'-2:word.punct()': sent[i-2][0] in string.punctuation,
	'-2:postag': sent[i-2][1]
	})
	if i - 3 >= 0:
	features.update({
	'-3:word.lower()': sent[i-3][0].lower(),
	'-3:word.isdigit()': sent[i-3][0].isdigit(),
	'-3:word.punct()': sent[i-3][0] in string.punctuation,
	'-3:postag': sent[i-3][1]
	})
	if i - 4 >= 0:
	features.update({
	'-4:word.lower()': sent[i-4][0].lower(),
	'-4:word.isdigit()': sent[i-4][0].isdigit(),
	'-4:word.punct()': sent[i-4][0] in string.punctuation,
	'-4:postag': sent[i-4][1]
	})
	else:
	features['BOS'] = True

	if i < len(sent)-1:
	word1 = sent[i+1][0]
	postag1 = sent[i+1][1]
	features.update({
	'+1:word.lower()': word1.lower(),
	'+1:word.isdigit()': word1.isdigit(),
	'+1:word.punct()': word1 in string.punctuation,
	'+1:postag': postag1
	})
	if i + 2 < len(sent):
	features.update({
	'+2:word.lower()': sent[i+2][0].lower(),
	'+2:word.isdigit()': sent[i+2][0].isdigit(),
	'+2:word.punct()': sent[i+2][0] in string.punctuation,
	'+2:postag': sent[i+2][1]
	})
	if i + 3 < len(sent):
	features.update({
	'+3:word.lower()': sent[i+3][0].lower(),
	'+3:word.isdigit()': sent[i+3][0].isdigit(),
	'+3:word.punct()': sent[i+3][0] in string.punctuation,
	'+3:postag': sent[i+3][1]
	})
	if i + 4 < len(sent):
	features.update({
	'+4:word.lower()': sent[i+4][0].lower(),
	'+4:word.isdigit()': sent[i+4][0].isdigit(),
	'+4:word.punct()': sent[i+4][0] in string.punctuation,
	'+4:postag': sent[i+4][1]
	})
	else:
	features['EOS'] = True

	return features
	```
	### Obtain features for a given sentence
	```
	def sent2features(sent):
	'''
	This method is used to extract features from the sentence.
	'''
	return [word2features(sent, i) for i in range(len(sent))]
	```
	### Load file from your directory
	```
	df_eval = pd.read_excel("testset_NER_LegalLens.xlsx")
	```
	### Evaluate data type and create pos_tags for each token
	```
	df_eval["tokens"] = df_eval["tokens"].apply(ast.literal_eval)
	df_eval['pos_tags'] = df_eval['tokens'].apply(lambda x: [tag[1]
	for tag in pos_tag(x)])
	```
	### Aggregate tokens to sentences
	```
	data_eval = []
	for i in range(len(df_eval)):
	for j in range(len(df_eval["tokens"][i])):
	data_eval.append(
	{
	"sentence_num": i+1,
	"id": df_eval["id"][i],
	"token": df_eval["tokens"][i][j],
	"pos_tag": df_eval["pos_tags"][i][j],
	}
	)
	data_eval = pd.DataFrame(data_eval)
	getter = getsentence(data_eval)
	sentences_eval = getter.sentences
	X_eval = [sent2features(s) for s in sentences_eval]
	```
	### Load model from your directory
	```
	crf = joblib.load("../models/crf.pkl")
	y_pred_eval = crf.predict(X_eval)
	print("NER tags predicted.")
	df_eval["ner_tags"] = y_pred_eval
	df_eval.drop(columns=["pos_tags"], inplace=True)
	print("Saving the predictions...")
	df_eval.to_csv("predictions_NERLens.csv", index=False)
	print("Predictions saved.")
	```

	## Training Details

	### Training Data

	<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

	[https://huggingface.co/datasets/darrow-ai/LegalLensNER]

	### Training Procedure

	<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
	The dataset was first evaluated for its datatype, POS_tags were created for each token in the text. With handcrafted features,
	the model was trained on a CPU. Training time is around 20-30 minutes for this dataset.
	#### Preprocessing [optional]
	For every token, POS_tags were assigned using NLTK library.


	#### Training Hyperparameters

	- Training regime: NA <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->

	#### Speeds, Sizes, Times [optional]

	<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
	NA

	## Evaluation

	<!-- This section describes the evaluation protocols and provides the results. -->
	The model was evaluated using macro-F1 score. A score of 0.32 was obtained on unseen test data.

	### Testing Data, Factors & Metrics

	#### Testing Data

	<!-- This should link to a Dataset Card if possible. -->

	[https://huggingface.co/datasets/darrow-ai/LegalLensNER]

	#### Factors

	<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->

	[More Information Needed]

	#### Metrics

	<!-- These are the evaluation metrics being used, ideally with a description of why. -->

	Macro-F1 score as it evaluates the true performance of the model and mitigates the performance boost created by highly skewed entities in the dataset.

	### Results

	0.32 macro-F1 score on unseen data.

	#### Summary

	The model was designed and developed to tackle NER task in unstructured text.

	## Model Examination [optional]

	<!-- Relevant interpretability work for the model goes here -->
	NA

	## Environmental Impact

	<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->

	Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

	- Hardware Type: 13th Gen Intel(R) Core(TM) i7-1365U
	- Hours used: 0.5 hours
	- Cloud Provider: NA
	- Compute Region: NA
	- Carbon Emitted: Unknown

	## Technical Specifications [optional]

	### Model Architecture and Objective

	[More Information Needed]

	### Compute Infrastructure

	[More Information Needed]

	#### Hardware

	[More Information Needed]

	#### Software

	[More Information Needed]

	## Citation [optional]

	<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

	BibTeX:

	[More Information Needed]

	APA:

	[More Information Needed]

	## Glossary [optional]

	<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->

	[More Information Needed]

	## More Information [optional]

	[More Information Needed]

	## Model Card Authors [optional]

	[More Information Needed]

	## Model Card Contact

	[More Information Needed]