--- license: apache-2.0 datasets: - darrow-ai/LegalLensNER language: - en metrics: - f1 pipeline_tag: token-classification library_name: sklearn tags: - ner - legal - crf --- # Model Card for Model ID Conditional Random Field model for performing named entity recognition with hand crafted features. Named entities recognied - Violation-on, Violation-by, and Law. The dataset is of the BIO format. The model achieves an F1-score of 0.32. ## Model Details ### Model Description The model was developed for LegalLens 2024 competition as part of Natural Legal Language Processing 2024. The model has handcrafted features for identifying named entities in the BIO format. - **Developed by:** Shashank M Chakravarthy - **Funded by [optional]:** NA - **Shared by [optional]:** NA - **Model type:** Statistical Model - **Language(s) (NLP):** English - **License:** Apache 2.0 License - **Finetuned from model [optional]:** NA ### Model Sources [optional] - **Repository:** NA - **Paper [optional]:** [https://aclanthology.org/2024.nllp-1.33.pdf] - **Demo [optional]:** NA ## Uses The model is used to detect named entities in unstructured text. The model can be extended to other entities with further modification to the handcrafted features. ### Direct Use The model can be directly used on any unstructured text with a bit of preprocessing. The files contain the evaluation script. ### Downstream Use [optional] ### Out-of-Scope Use This model is handcrafted for detecting violations and law in text. Can be used for other legal text which may contain similar entities. ## Bias, Risks, and Limitations The limitation comes with the handcrafting the features. ### Recommendations If the text used for prediction is improperly processed without POS tags, the model will not perform as its designed to. ## How to Get Started with the Model Use the code below to get started with the model. ### Load libraries ``` import ast import pandas as pd import joblib import nltk from nltk import pos_tag import string from nltk.stem import WordNetLemmatizer from nltk.stem import PorterStemmer ``` ### Check if nltk modules are downloaded, if not download them ``` nltk.download('wordnet') nltk.download('omw-1.4') nltk.download("averaged_perceptron_tagger") ``` ### Class for grouping tokens as sentences (redundant if text processed directly) ``` class getsentence(object): ''' This class is used to get the sentences from the dataset. Converts from BIO format to sentences using their sentence numbers ''' def __init__(self, data): self.n_sent = 1.0 self.data = data self.empty = False self.grouped = self.data.groupby("sentence_num").apply(self._agg_func) self.sentences = [s for s in self.grouped] def _agg_func(self, s): return [(w, p) for w, p in zip(s["token"].values.tolist(), s["pos_tag"].values.tolist())] ``` ### Creates features for words in a sentence (code can be reduced using iteration) ``` def word2features(sent, i): ''' This method is used to extract features from the words in the sentence. The main features extracted are: - word.lower(): The word in lowercase - word.isdigit(): If the word is a digit - word.punct(): If the word is a punctuation - postag: The pos tag of the word - word.lemma(): The lemma of the word - word.stem(): The stem of the word The features (not all) are also extracted for the 4 previous and 4 next words. ''' global token_count wordnet_lemmatizer = WordNetLemmatizer() porter_stemmer = PorterStemmer() word = sent[i][0] postag = sent[i][1] features = { 'bias': 1.0, 'word.lower()': word.lower(), 'word.isdigit()': word.isdigit(), # Check if its punctuations 'word.punct()': word in string.punctuation, 'postag': postag, # Lemma of the word 'word.lemma()': wordnet_lemmatizer.lemmatize(word), # Stem of the word 'word.stem()': porter_stemmer.stem(word) } if i > 0: word1 = sent[i-1][0] postag1 = sent[i-1][1] features.update({ '-1:word.lower()': word1.lower(), '-1:word.isdigit()': word1.isdigit(), '-1:word.punct()': word1 in string.punctuation, '-1:postag': postag1 }) if i - 2 >= 0: features.update({ '-2:word.lower()': sent[i-2][0].lower(), '-2:word.isdigit()': sent[i-2][0].isdigit(), '-2:word.punct()': sent[i-2][0] in string.punctuation, '-2:postag': sent[i-2][1] }) if i - 3 >= 0: features.update({ '-3:word.lower()': sent[i-3][0].lower(), '-3:word.isdigit()': sent[i-3][0].isdigit(), '-3:word.punct()': sent[i-3][0] in string.punctuation, '-3:postag': sent[i-3][1] }) if i - 4 >= 0: features.update({ '-4:word.lower()': sent[i-4][0].lower(), '-4:word.isdigit()': sent[i-4][0].isdigit(), '-4:word.punct()': sent[i-4][0] in string.punctuation, '-4:postag': sent[i-4][1] }) else: features['BOS'] = True if i < len(sent)-1: word1 = sent[i+1][0] postag1 = sent[i+1][1] features.update({ '+1:word.lower()': word1.lower(), '+1:word.isdigit()': word1.isdigit(), '+1:word.punct()': word1 in string.punctuation, '+1:postag': postag1 }) if i + 2 < len(sent): features.update({ '+2:word.lower()': sent[i+2][0].lower(), '+2:word.isdigit()': sent[i+2][0].isdigit(), '+2:word.punct()': sent[i+2][0] in string.punctuation, '+2:postag': sent[i+2][1] }) if i + 3 < len(sent): features.update({ '+3:word.lower()': sent[i+3][0].lower(), '+3:word.isdigit()': sent[i+3][0].isdigit(), '+3:word.punct()': sent[i+3][0] in string.punctuation, '+3:postag': sent[i+3][1] }) if i + 4 < len(sent): features.update({ '+4:word.lower()': sent[i+4][0].lower(), '+4:word.isdigit()': sent[i+4][0].isdigit(), '+4:word.punct()': sent[i+4][0] in string.punctuation, '+4:postag': sent[i+4][1] }) else: features['EOS'] = True return features ``` ### Obtain features for a given sentence ``` def sent2features(sent): ''' This method is used to extract features from the sentence. ''' return [word2features(sent, i) for i in range(len(sent))] ``` ### Load file from your directory ``` df_eval = pd.read_excel("testset_NER_LegalLens.xlsx") ``` ### Evaluate data type and create pos_tags for each token ``` df_eval["tokens"] = df_eval["tokens"].apply(ast.literal_eval) df_eval['pos_tags'] = df_eval['tokens'].apply(lambda x: [tag[1] for tag in pos_tag(x)]) ``` ### Aggregate tokens to sentences ``` data_eval = [] for i in range(len(df_eval)): for j in range(len(df_eval["tokens"][i])): data_eval.append( { "sentence_num": i+1, "id": df_eval["id"][i], "token": df_eval["tokens"][i][j], "pos_tag": df_eval["pos_tags"][i][j], } ) data_eval = pd.DataFrame(data_eval) getter = getsentence(data_eval) sentences_eval = getter.sentences X_eval = [sent2features(s) for s in sentences_eval] ``` ### Load model from your directory ``` crf = joblib.load("../models/crf.pkl") y_pred_eval = crf.predict(X_eval) print("NER tags predicted.") df_eval["ner_tags"] = y_pred_eval df_eval.drop(columns=["pos_tags"], inplace=True) print("Saving the predictions...") df_eval.to_csv("predictions_NERLens.csv", index=False) print("Predictions saved.") ``` ## Training Details ### Training Data [https://huggingface.co/datasets/darrow-ai/LegalLensNER] ### Training Procedure The dataset was first evaluated for its datatype, POS_tags were created for each token in the text. With handcrafted features, the model was trained on a CPU. Training time is around 20-30 minutes for this dataset. #### Preprocessing [optional] For every token, POS_tags were assigned using NLTK library. #### Training Hyperparameters - **Training regime:** NA #### Speeds, Sizes, Times [optional] NA ## Evaluation The model was evaluated using macro-F1 score. A score of 0.32 was obtained on unseen test data. ### Testing Data, Factors & Metrics #### Testing Data [https://huggingface.co/datasets/darrow-ai/LegalLensNER] #### Factors [More Information Needed] #### Metrics Macro-F1 score as it evaluates the true performance of the model and mitigates the performance boost created by highly skewed entities in the dataset. ### Results 0.32 macro-F1 score on unseen data. #### Summary The model was designed and developed to tackle NER task in unstructured text. ## Model Examination [optional] NA ## Environmental Impact Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). - **Hardware Type:** 13th Gen Intel(R) Core(TM) i7-1365U - **Hours used:** 0.5 hours - **Cloud Provider:** NA - **Compute Region:** NA - **Carbon Emitted:** Unknown ## Technical Specifications [optional] ### Model Architecture and Objective [More Information Needed] ### Compute Infrastructure [More Information Needed] #### Hardware [More Information Needed] #### Software [More Information Needed] ## Citation [optional] **BibTeX:** [More Information Needed] **APA:** [More Information Needed] ## Glossary [optional] [More Information Needed] ## More Information [optional] [More Information Needed] ## Model Card Authors [optional] [More Information Needed] ## Model Card Contact [More Information Needed]