--- license: apache-2.0 datasets: - Tongda/bid-announcement-zh-v1.0 base_model: - Qwen/Qwen2-1.5B-Instruct pipeline_tag: text-generation tags: - text-generation-inference library_name: transformers --- ## **Model Overview** This model is a fine-tuned version of the Qwen2-1.5-Instruct using Low-Rank Adaptation (LoRA). It is specifically designed for extracting key information from bidding and bid-winning announcements. The model focuses on identifying structured data such as project names, announcement types, budget amounts, and deadlines in various formats of bidding notices. The base model, Qwen2-1.5-Instruct, is a large-scale language model optimized for instruction-following tasks, and this fine-tuned version leverages its capabilities for precise data extraction tasks in Chinese bid announcement contexts. --- ## **Use Cases** The model can be used in applications that require the automatic extraction of structured data from text documents, particularly related to government bidding and procurement processes. For instance, based on [the sample announcement](https://www.qhggzyjy.gov.cn/ggzy/jyxx/001002/001002001/20240827/1358880795267533.html), the generated output is as follows: ``` 项目名称:"大通县公安局警用无人自动化机场项目" 公告类型:"采购公告-竞磋" 行业分类:"其他" 发布时间:"2024-08-27" 预算金额:"941500.00元" 采购人:"大通县公安局(本级)" 响应文件截至提交时间:"2024-09-10 09:00" 开标地址:"大通县政府采购服务中心" 所在地区:"青海省" ``` --- ## **Key Features** 1. **Fine-tuned with LoRA**: The model has been adapted using LoRA, a parameter-efficient fine-tuning method, allowing it to focus on specific tasks while maintaining the power of the large base model. 2. **Robust Information Extraction**: The model is trained to extract and validate crucial fields, including budget values, submission deadlines, and industry classifications, ensuring accurate outputs even when encountering variable formats. 3. **Language & Domain Specificity**: The model excels in parsing official bidding announcements in Chinese and accurately extracting the required information for downstream processes. --- ## **Model Architecture** - **Base Model**: Qwen2-1.5B-Instruct - **Fine-Tuning Technique**: LoRA - **Training Data**: Fine-tuned on structured and unstructured government bidding announcements - **Framework**: Hugging Face Transformers & PEFT (Parameter Efficient Fine Tuning) ## **Technical Specifications** - **Device Compatibility**: CUDA (GPU-enabled) - **Tokenization**: Utilizes `AutoTokenizer` from Hugging Face, optimized for instruction-following tasks. ## **Requirements** ```shell pip install --upgrade 'transformers>=4.44.2' 'torch>=2.0' accelerate ``` ## **Usage Example** ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch device = "cuda" model = AutoModelForCausalLM.from_pretrained("Tongda/Tongda1-1.5B-BKI", device_map="auto", torch_dtype=torch.float16) tokenizer = AutoTokenizer.from_pretrained("Tongda/Tongda1-1.5B-BKI") model.eval() instruction = "分析给定的公告,提取其中的“项目名称”、“公告类型”、“行业分类”、“发布时间”、“预算金额”、“采购人”、“响应文件截至提交时间”、”开标地址“、“所在地区”,并将其以json格式进行输出。如果公告出现“最高投标限价”相关的值,则“预算金额”为该值。请再三确认提取的值为项目的“预算金额”,而不是其他和“预算金额”无关的数值,否则“预算金额”中填入'None'。如果确认提取到了“预算金额”,请重点确认提取到的金额的单位,所有的“预算金额”单位为“元”。当涉及到进制转换的计算(比如“万元”转换为“元”单位)时,必须进行进制转换。其中“公告类型”只能从以下12类中挑选:采购公告-招标、采购公告-邀标、采购公告-询价、采购公告-竞谈、采购公告-竞磋、采购公告-竞价、采购公告-单一来源、采购公告-变更、采购结果-中标、采购结果-终止、采购结果-废标、采购结果-合同。其中,“行业分类”只能从以下12类中挑选:建筑与基础设施、信息技术与通信、能源与环保、交通与物流、金融与保险、医疗与健康、教育与文化、农业与林业、制造与工业、政府与公共事业、旅游与娱乐、其他。" # the content of any bid announcement input_report = "#### 通答产业园区(2024年-2027年)智能一体化项目公开招标公告..." messages = [ {"role": "system", "content": instruction}, {"role": "user", "content": input_report} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) model_inputs = tokenizer([text], return_tensors="pt").to(device) generated_ids = model.generate( model_inputs.input_ids, max_new_tokens=512 ) generated_ids = [ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) ] response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] response ``` --- ## **Evaluation & Performance** The Tongda1-1.5B-BKI model has shown remarkable performance in information extraction tasks. Compared to the baseline model Qwen2-1.5B-Instruct, Tongda1-1.5B-BKI excels across multiple evaluation metrics, particularly in extracting key information from tender announcements, achieving significant improvements. Even when compared to larger models like Qwen2.5-3B-Instruct and Qwen2-7B-Instruct, Tongda1-1.5B-BKI still demonstrates outstanding performance. Additionally, it outperforms the optimized online model glm-4-flash. Here are the evaluation results for each model: | Model | ROUGE-1 | ROUGE-2 | ROUGE-Lsum | BLEU | |-----------------------|---------|---------|------------|-------| | Tongda1-1.5B-BKI | 0.853 | 0.787 | 0.853 | 0.852 | | Qwen2-1.5B-Instruct | 0.412 | 0.231 | 0.411 | 0.431 | | Qwen2.5-3B-Instruct | 0.686 | 0.578 | 0.687 | 0.755 | | Qwen2-7B-Instruct | 0.703 | 0.578 | 0.703 | 0.789 | | glm-4-flash | 0.774 | 0.655 | 0.775 | 0.816 | ![image/png](https://cdn-uploads.huggingface.co/production/uploads/65ebcbd0c8577b39464e6dc0/Qiyi7onDe99b2USArl0oG.png) --- ## **Limitations** - **Language Limitation**: The model is primarily trained on Chinese bidding announcements. Performance on other languages or non-bidding content may be limited. - **Strict Formatting**: The model may have reduced accuracy when the bidding announcements deviate significantly from common structures. --- ## **Citation** If you use this model, please consider citing it as follows: ``` @inproceedings{Tongda1-1.5B-BKI, title={Tongda1-1.5B-BKI: LoRA Fine-tuned Model for Bidding Announcements}, author={Ted-Z}, year={2024} } ``` ## **Contact** For further inquiries or fine-tuning services, please contact us at [Tongda](https://www.tongdaai.com/).