law-ai commited on
Commit
2ab1999
1 Parent(s): c1a4bbe

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +37 -0
README.md CHANGED
@@ -1,3 +1,40 @@
1
  ---
 
 
 
 
2
  license: mit
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: en
3
+ pipeline_tag: fill-mask
4
+ tags:
5
+ - legal
6
  license: mit
7
  ---
8
+ ### InLegalBERT
9
+ Model and tokenizer files for the InLegalBERT model.
10
+
11
+ ### Training Data
12
+ For building the pre-training corpus of Indian legal text, we collected a large corpus of case documents from the Indian Supreme Court and many High Courts of India.
13
+ These documents were collected from diverse publicly available sources on the Web, such as official websites of these courts (e.g., [the website of the Indian Supreme Court](https://main.sci.gov.in/)), the erstwhile website of the Legal Information Institute of India,
14
+ the popular legal repository [IndianKanoon](https://www.indiankanoon.org), and so on.
15
+ The court cases in our dataset range from 1950 to 2019, and belong to all legal domains, such as Civil, Criminal, Constitutional, and so on.
16
+ Additionally, we collected 1,113 Central Government Acts, which are the documents codifying the laws of the country. Each Act is a collection of related laws, called Sections. These 1,113 Acts contain a total of 32,021 Sections.
17
+ In total, our dataset contains around 5.4 million Indian legal documents (all in the English language).
18
+ The raw text corpus size is around 27 GB.
19
+
20
+ ### Training Objective
21
+ This model is initialized with the [Legal-BERT model](https://huggingface.co/zlucia/legalbert) from the paper [When does pretraining help?: assessing self-supervised learning for law and the CaseHOLD dataset of 53,000+ legal holdings](https://dl.acm.org/doi/abs/10.1145/3462757.3466088). In our work, we refer to this model as CaseLawBERT, and our re-trained model as InCaseLawBERT.
22
+
23
+ ### Usage
24
+ Using the tokenizer (same as [CaseLawBERT](https://huggingface.co/zlucia/legalbert))
25
+ ```python
26
+ from transformers import AutoTokenizer
27
+ tokenizer = AutoTokenizer.from_pretrained("law-ai/InCaseLawBERT")
28
+ ```
29
+ Using the model to get embeddings/representations for a sentence
30
+ ```python
31
+ from transformers import AutoModel
32
+ model = AutoModel.from_pretrained("law-ai/InCaseLawBERT")
33
+ ```
34
+ Using the model for further pre-training with MLM and NSP
35
+ ```python
36
+ from transformers import BertForPreTraining
37
+ model_with_pretraining_heads = BertForPreTraining.from_pretrained("law-ai/InCaseLawBERT")
38
+ ```
39
+
40
+ ### Citation