File size: 7,162 Bytes
389ec9e 6458929 389ec9e 1029e4b 389ec9e 8103f3d 1029e4b 8103f3d 389ec9e 8103f3d 389ec9e 8103f3d 389ec9e 25f630d 389ec9e 25f630d 389ec9e 678048b 389ec9e 678048b 3a58501 389ec9e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 |
---
license: apache-2.0
datasets:
- Tongda/bid-announcement-zh-v1.0
base_model:
- Qwen/Qwen2-1.5B-Instruct
pipeline_tag: text-generation
tags:
- text-generation-inference
library_name: transformers
---
## **Model Overview**
This model is a fine-tuned version of the Qwen2-1.5-Instruct using Low-Rank Adaptation (LoRA). It is specifically designed for extracting key information from bidding and bid-winning announcements. The model focuses on identifying structured data such as project names, announcement types, budget amounts, and deadlines in various formats of bidding notices.
The base model, Qwen2-1.5-Instruct, is a large-scale language model optimized for instruction-following tasks, and this fine-tuned version leverages its capabilities for precise data extraction tasks in Chinese bid announcement contexts.
---
## **Use Cases**
The model can be used in applications that require the automatic extraction of structured data from text documents, particularly related to government bidding and procurement processes. For instance, based on [the sample announcement](https://www.qhggzyjy.gov.cn/ggzy/jyxx/001002/001002001/20240827/1358880795267533.html), the generated output is as follows:
```
项目名称:"大通县公安局警用无人自动化机场项目"
公告类型:"采购公告-竞磋"
行业分类:"其他"
发布时间:"2024-08-27"
预算金额:"941500.00元"
采购人:"大通县公安局(本级)"
响应文件截至提交时间:"2024-09-10 09:00"
开标地址:"大通县政府采购服务中心"
所在地区:"青海省"
```
---
## **Key Features**
1. **Fine-tuned with LoRA**: The model has been adapted using LoRA, a parameter-efficient fine-tuning method, allowing it to focus on specific tasks while maintaining the power of the large base model.
2. **Robust Information Extraction**: The model is trained to extract and validate crucial fields, including budget values, submission deadlines, and industry classifications, ensuring accurate outputs even when encountering variable formats.
3. **Language & Domain Specificity**: The model excels in parsing official bidding announcements in Chinese and accurately extracting the required information for downstream processes.
---
## **Model Architecture**
- **Base Model**: Qwen2-1.5B-Instruct
- **Fine-Tuning Technique**: LoRA
- **Training Data**: Fine-tuned on structured and unstructured government bidding announcements
- **Framework**: Hugging Face Transformers & PEFT (Parameter Efficient Fine Tuning)
## **Technical Specifications**
- **Device Compatibility**: CUDA (GPU-enabled)
- **Tokenization**: Utilizes `AutoTokenizer` from Hugging Face, optimized for instruction-following tasks.
## **Requirements**
```shell
pip install --upgrade 'transformers>=4.44.2' 'torch>=2.0' accelerate
```
## **Usage Example**
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
device = "cuda"
model = AutoModelForCausalLM.from_pretrained("Tongda/Tongda1-1.5B-BKI", device_map="auto", torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained("Tongda/Tongda1-1.5B-BKI")
model.eval()
instruction = "分析给定的公告,提取其中的“项目名称”、“公告类型”、“行业分类”、“发布时间”、“预算金额”、“采购人”、“响应文件截至提交时间”、”开标地址“、“所在地区”,并将其以json格式进行输出。如果公告出现“最高投标限价”相关的值,则“预算金额”为该值。请再三确认提取的值为项目的“预算金额”,而不是其他和“预算金额”无关的数值,否则“预算金额”中填入'None'。如果确认提取到了“预算金额”,请重点确认提取到的金额的单位,所有的“预算金额”单位为“元”。当涉及到进制转换的计算(比如“万元”转换为“元”单位)时,必须进行进制转换。其中“公告类型”只能从以下12类中挑选:采购公告-招标、采购公告-邀标、采购公告-询价、采购公告-竞谈、采购公告-竞磋、采购公告-竞价、采购公告-单一来源、采购公告-变更、采购结果-中标、采购结果-终止、采购结果-废标、采购结果-合同。其中,“行业分类”只能从以下12类中挑选:建筑与基础设施、信息技术与通信、能源与环保、交通与物流、金融与保险、医疗与健康、教育与文化、农业与林业、制造与工业、政府与公共事业、旅游与娱乐、其他。"
# the content of any bid announcement
input_report = "#### 通答产业园区(2024年-2027年)智能一体化项目公开招标公告..."
messages = [
{"role": "system", "content": instruction},
{"role": "user", "content": input_report}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)
generated_ids = model.generate(
model_inputs.input_ids,
max_new_tokens=512
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
response
```
---
## **Evaluation & Performance**
The Tongda1-1.5B-BKI model has shown remarkable performance in information extraction tasks. Compared to the baseline model Qwen2-1.5B-Instruct, Tongda1-1.5B-BKI excels across multiple evaluation metrics, particularly in extracting key information from tender announcements, achieving significant improvements. Even when compared to larger models like Qwen2.5-3B-Instruct and Qwen2-7B-Instruct, Tongda1-1.5B-BKI still demonstrates outstanding performance. Additionally, it outperforms the optimized online model glm-4-flash. Here are the evaluation results for each model:
| Model | ROUGE-1 | ROUGE-2 | ROUGE-Lsum | BLEU |
|-----------------------|---------|---------|------------|-------|
| Tongda1-1.5B-BKI | 0.853 | 0.787 | 0.853 | 0.852 |
| Qwen2-1.5B-Instruct | 0.412 | 0.231 | 0.411 | 0.431 |
| Qwen2.5-3B-Instruct | 0.686 | 0.578 | 0.687 | 0.755 |
| Qwen2-7B-Instruct | 0.703 | 0.578 | 0.703 | 0.789 |
| glm-4-flash | 0.774 | 0.655 | 0.775 | 0.816 |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/65ebcbd0c8577b39464e6dc0/Qiyi7onDe99b2USArl0oG.png)
---
## **Limitations**
- **Language Limitation**: The model is primarily trained on Chinese bidding announcements. Performance on other languages or non-bidding content may be limited.
- **Strict Formatting**: The model may have reduced accuracy when the bidding announcements deviate significantly from common structures.
---
## **Citation**
If you use this model, please consider citing it as follows:
```
@inproceedings{Tongda1-1.5B-BKI,
title={Tongda1-1.5B-BKI: LoRA Fine-tuned Model for Bidding Announcements},
author={Ted-Z},
year={2024}
}
```
## **Contact**
For further inquiries or fine-tuning services, please contact us at [Tongda](https://www.tongdaai.com/). |