idefics2-edgar / README.md

Update README.md

1ec7d99 verified 5 months ago

5.78 kB

	---
	library_name: transformers
	license: mit
	datasets:
	- chenghao/sec-material-contracts-qa-splitted
	- chenghao/sec-material-contracts-qa
	- jordyvl/DUDE_subset_100val
	language:
	- en
	pipeline_tag: document-question-answering
	---

	# Idefices2-EDGAR

	Idefices2 8B fine-tuned on 800+ multi-page documents for Visual DocQA. Make sure you have the latest peft and transformers before loading the model. GPU is required for it to work properly.

	Compared to the base model, it has a lower edit distance (53% improvement on micro average) on the test set.

	\| \| Category \| Idefics2-8B \| Idefics2-8B-EDGAR \| Δ(↑) \|
	\|---:\|:----------------------------\|--------------:\|--------------------:\|:-------\|
	\| 0 \| agreement_date \| 0.878489 \| 0.0999479 \| 88.62% \|
	\| 1 \| agreement_term \| 0.907067 \| 0.438816 \| 51.62% \|
	\| 2 \| auto_renewal \| 0.634946 \| 0.0516129 \| 91.87% \|
	\| 3 \| contract_value \| 0.474438 \| 0.418815 \| 11.72% \|
	\| 4 \| counterparty_address \| 0.771387 \| 0.59835 \| 22.43% \|
	\| 5 \| counterparty_name \| 0.825491 \| 0.633359 \| 23.27% \|
	\| 6 \| counterparty_signer_name \| 0.842091 \| 0.480444 \| 42.95% \|
	\| 7 \| counterparty_signer_title \| 0.61746 \| 0.496041 \| 19.66% \|
	\| 8 \| effective_date \| 0.903268 \| 0.125641 \| 86.09% \|
	\| 9 \| expiration_date \| 0.88673 \| 0.235197 \| 73.48% \|
	\| 10 \| governing_law \| 0.881037 \| 0.308771 \| 64.95% \|
	\| 11 \| opt_out_length \| 0.431548 \| 0.047619 \| 88.97% \|
	\| 12 \| party_address \| 0.730897 \| 0.608301 \| 16.77% \|
	\| 13 \| party_name \| 0.726411 \| 0.490194 \| 32.52% \|
	\| 14 \| payment_frequency \| 0.686123 \| 0.373724 \| 45.53% \|
	\| 15 \| payment_term \| 0.854552 \| 0.593333 \| 30.57% \|
	\| 16 \| renewal_term \| 0.92829 \| 0.0595238 \| 93.59% \|
	\| 17 \| termination_for_cause \| 0.436 \| 0.048 \| 88.99% \|
	\| 18 \| termination_for_convenience \| 0.628261 \| 0.156522 \| 75.09% \|
	\| 19 \| termination_notice_period \| 0.329748 \| 0.178394 \| 45.90% \|
	\| 20 \| venue \| 0.781417 \| 0.61403 \| 21.42% \|



	![image/png](https://cdn-uploads.huggingface.co/production/uploads/607a5b44489fc71534e91c0e/3Jc7I1Fj2J3rabos2HLyY.png)

	## Model Details

	### Model Description

	Finetuned form [Idefics2](https://huggingface.co/docs/transformers/main/en/model_doc/idefics2).

	## Uses

	```python
	import torch
	from transformers import AutoProcessor, Idefics2ForConditionalGeneration, BitsAndBytesConfig
	from datasets import load_from_disk

	base_model = "HuggingFaceM4/idefics2-8b"
	peft_model_id = "chenghao/idefics2-edgar"
	quantization_config = BitsAndBytesConfig(
	load_in_4bit=True,
	bnb_4bit_quant_type="nf4",
	bnb_4bit_use_double_quant=True,
	bnb_4bit_compute_dtype=torch.float16
	)
	model = Idefics2ForConditionalGeneration.from_pretrained(
	peft_model_id,
	torch_dtype=torch.float16,
	quantization_config=quantization_config,
	)

	model.eval()
	processor = AutoProcessor.from_pretrained(base_model, do_image_splitting=True,
	size={"longest_edge": 490, "shortest_edge": 350})
	dataset = load_from_disk("local-dataset")
	test_example = dataset["test"][30]
	images, question, answer = test_example["images"], test_example["question"], test_example["answer"]

	messages = [
	{
	"role": "user",
	"content": [{"type": "image"} for _ in range(len(images))] + [{"type": "text", "text": question}],
	},
	]
	prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
	inputs = processor(text=prompt, images=images, return_tensors="pt").to("cuda")
	with torch.no_grad():
	generated_ids = model.generate(**inputs, max_new_tokens=1024)
	generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
	preds = [t.split("Assistant:", 1)[-1].strip() for t in generated_texts]
	print(f"""
	Question: {question}
	Answer: {answer}
	Prediction: {preds or 'N/A'}
	""")
	```

	## Training Details

	### Training Data

	[SEC Contract QA](https://huggingface.co/datasets/chenghao/sec-material-contracts-qa)

	### Training Procedure

	10 epochs with QLoRA. Trained with A100-80GB for about 10 hours.

	```
	MAX_LENGTH = 1024
	USE_LORA = False
	USE_QLORA = True
	MAX_PAGE = 5

	config = {
	"max_epochs": 10,
	# "val_check_interval": 0.2,
	"check_val_every_n_epoch": 1,
	"gradient_clip_val": 1.0,
	"accumulate_grad_batches": 12,
	"lr": 1e-4,
	"batch_size": 2,
	"precision": "16-mixed",
	"seed": 42,
	"warmup_steps": 50,
	"result_path": "./result",
	"verbose": True,
	}
	```

	#### Preprocessing [optional]

	No image splitting due to memory limit.

	```python
	processor = AutoProcessor.from_pretrained(
	"HuggingFaceM4/idefics2-8b",
	do_image_splitting=False,
	size={"longest_edge": 490, "shortest_edge": 350}
	)
	```

	#### Training Hyperparameters

	```python
	quantization_config = BitsAndBytesConfig(
	load_in_4bit=True,
	bnb_4bit_quant_type="nf4",
	bnb_4bit_use_double_quant=True,
	bnb_4bit_compute_dtype=torch.float16
	)
	model = Idefics2ForConditionalGeneration.from_pretrained(
	"HuggingFaceM4/idefics2-8b",
	torch_dtype=torch.float16,
	quantization_config=quantization_config,
	)
	```

	#### Speeds, Sizes, Times [optional]


	## Evaluation

	### Testing Data, Factors & Metrics

	#### Testing Data

	20% percent of the dataset.

	#### Metrics

	Edit Distance (nltk).

	### Results

	See above.