|
--- |
|
library_name: transformers |
|
license: mit |
|
datasets: |
|
- chenghao/sec-material-contracts-qa-splitted |
|
- chenghao/sec-material-contracts-qa |
|
- jordyvl/DUDE_subset_100val |
|
language: |
|
- en |
|
pipeline_tag: document-question-answering |
|
--- |
|
|
|
# Idefices2-EDGAR |
|
|
|
Idefices2 8B fine-tuned on 800+ multi-page documents for Visual DocQA. Make sure you have the latest peft and transformers before loading the model. GPU is required for it to work properly. |
|
|
|
Compared to the base model, it has a lower edit distance (53% improvement on micro average) on the test set. |
|
|
|
| | Category | Idefics2-8B | Idefics2-8B-EDGAR | Δ(↑) | |
|
|---:|:----------------------------|--------------:|--------------------:|:-------| |
|
| 0 | agreement_date | 0.878489 | 0.0999479 | 88.62% | |
|
| 1 | agreement_term | 0.907067 | 0.438816 | 51.62% | |
|
| 2 | auto_renewal | 0.634946 | 0.0516129 | 91.87% | |
|
| 3 | contract_value | 0.474438 | 0.418815 | 11.72% | |
|
| 4 | counterparty_address | 0.771387 | 0.59835 | 22.43% | |
|
| 5 | counterparty_name | 0.825491 | 0.633359 | 23.27% | |
|
| 6 | counterparty_signer_name | 0.842091 | 0.480444 | 42.95% | |
|
| 7 | counterparty_signer_title | 0.61746 | 0.496041 | 19.66% | |
|
| 8 | effective_date | 0.903268 | 0.125641 | 86.09% | |
|
| 9 | expiration_date | 0.88673 | 0.235197 | 73.48% | |
|
| 10 | governing_law | 0.881037 | 0.308771 | 64.95% | |
|
| 11 | opt_out_length | 0.431548 | 0.047619 | 88.97% | |
|
| 12 | party_address | 0.730897 | 0.608301 | 16.77% | |
|
| 13 | party_name | 0.726411 | 0.490194 | 32.52% | |
|
| 14 | payment_frequency | 0.686123 | 0.373724 | 45.53% | |
|
| 15 | payment_term | 0.854552 | 0.593333 | 30.57% | |
|
| 16 | renewal_term | 0.92829 | 0.0595238 | 93.59% | |
|
| 17 | termination_for_cause | 0.436 | 0.048 | 88.99% | |
|
| 18 | termination_for_convenience | 0.628261 | 0.156522 | 75.09% | |
|
| 19 | termination_notice_period | 0.329748 | 0.178394 | 45.90% | |
|
| 20 | venue | 0.781417 | 0.61403 | 21.42% | |
|
|
|
|
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/607a5b44489fc71534e91c0e/3Jc7I1Fj2J3rabos2HLyY.png) |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
Finetuned form [Idefics2](https://huggingface.co/docs/transformers/main/en/model_doc/idefics2). |
|
|
|
## Uses |
|
|
|
```python |
|
import torch |
|
from transformers import AutoProcessor, Idefics2ForConditionalGeneration, BitsAndBytesConfig |
|
from datasets import load_from_disk |
|
|
|
base_model = "HuggingFaceM4/idefics2-8b" |
|
peft_model_id = "chenghao/idefics2-edgar" |
|
quantization_config = BitsAndBytesConfig( |
|
load_in_4bit=True, |
|
bnb_4bit_quant_type="nf4", |
|
bnb_4bit_use_double_quant=True, |
|
bnb_4bit_compute_dtype=torch.float16 |
|
) |
|
model = Idefics2ForConditionalGeneration.from_pretrained( |
|
peft_model_id, |
|
torch_dtype=torch.float16, |
|
quantization_config=quantization_config, |
|
) |
|
|
|
model.eval() |
|
processor = AutoProcessor.from_pretrained(base_model, do_image_splitting=True, |
|
size={"longest_edge": 490, "shortest_edge": 350}) |
|
dataset = load_from_disk("local-dataset") |
|
test_example = dataset["test"][30] |
|
images, question, answer = test_example["images"], test_example["question"], test_example["answer"] |
|
|
|
messages = [ |
|
{ |
|
"role": "user", |
|
"content": [{"type": "image"} for _ in range(len(images))] + [{"type": "text", "text": question}], |
|
}, |
|
] |
|
prompt = processor.apply_chat_template(messages, add_generation_prompt=True) |
|
inputs = processor(text=prompt, images=images, return_tensors="pt").to("cuda") |
|
with torch.no_grad(): |
|
generated_ids = model.generate(**inputs, max_new_tokens=1024) |
|
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True) |
|
preds = [t.split("Assistant:", 1)[-1].strip() for t in generated_texts] |
|
print(f""" |
|
Question: {question} |
|
Answer: {answer} |
|
Prediction: {preds or 'N/A'} |
|
""") |
|
``` |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
[SEC Contract QA](https://huggingface.co/datasets/chenghao/sec-material-contracts-qa) |
|
|
|
### Training Procedure |
|
|
|
10 epochs with QLoRA. Trained with A100-80GB for about 10 hours. |
|
|
|
``` |
|
MAX_LENGTH = 1024 |
|
USE_LORA = False |
|
USE_QLORA = True |
|
MAX_PAGE = 5 |
|
|
|
config = { |
|
"max_epochs": 10, |
|
# "val_check_interval": 0.2, |
|
"check_val_every_n_epoch": 1, |
|
"gradient_clip_val": 1.0, |
|
"accumulate_grad_batches": 12, |
|
"lr": 1e-4, |
|
"batch_size": 2, |
|
"precision": "16-mixed", |
|
"seed": 42, |
|
"warmup_steps": 50, |
|
"result_path": "./result", |
|
"verbose": True, |
|
} |
|
``` |
|
|
|
#### Preprocessing [optional] |
|
|
|
No image splitting due to memory limit. |
|
|
|
```python |
|
processor = AutoProcessor.from_pretrained( |
|
"HuggingFaceM4/idefics2-8b", |
|
do_image_splitting=False, |
|
size={"longest_edge": 490, "shortest_edge": 350} |
|
) |
|
``` |
|
|
|
#### Training Hyperparameters |
|
|
|
```python |
|
quantization_config = BitsAndBytesConfig( |
|
load_in_4bit=True, |
|
bnb_4bit_quant_type="nf4", |
|
bnb_4bit_use_double_quant=True, |
|
bnb_4bit_compute_dtype=torch.float16 |
|
) |
|
model = Idefics2ForConditionalGeneration.from_pretrained( |
|
"HuggingFaceM4/idefics2-8b", |
|
torch_dtype=torch.float16, |
|
quantization_config=quantization_config, |
|
) |
|
``` |
|
|
|
#### Speeds, Sizes, Times [optional] |
|
|
|
|
|
## Evaluation |
|
|
|
### Testing Data, Factors & Metrics |
|
|
|
#### Testing Data |
|
|
|
20% percent of the dataset. |
|
|
|
#### Metrics |
|
|
|
Edit Distance (nltk). |
|
|
|
### Results |
|
|
|
See above. |