liangxz commited on
Commit
44f6cc5
1 Parent(s): d843044

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +67 -41
README.md CHANGED
@@ -9,40 +9,43 @@ tags:
9
  ---
10
 
11
  <p align="center" width="100%">
12
- <a href="" target="_blank"><img src="https://github.com/zjunlp/CaMA/blob/main/assets/logo.jpg?raw=true" alt="ZJU-CaMA" style="width: 30%; min-width: 30px; display: block; margin: auto;"></a>
13
  </p>
14
 
15
 
16
- > This is the result of the `CaMA-13B` LoRA weights. You can click [here](https://github.com/zjunlp/cama) to learn more.
17
 
18
 
19
- # CaMA: A Chinese-English Bilingual LLaMA Model
20
 
21
- With the birth of ChatGPT, artificial intelligence has also entered the "iPhone moment," where various large language models (LLMs) have sprung up like mushrooms. The wave of these large models has quickly swept through artificial intelligence fields beyond natural language processing. However, training such a model requires extremely high hardware costs, and open-source language models are scarce due to various reasons, making Chinese language models even more scarce. It wasn't until the open-sourcing of LLaMA that a variety of language models based on LLaMA started to emerge. This project is also based on the LLaMA model. To further enhance Chinese language capabilities without compromising its original language distribution, we first <b>(1) perform additional pre-training on LLaMA (13B) using Chinese corpora, aiming to improve the model's Chinese comprehension and knowledge base while preserving its original English and code abilities to the greatest extent possible;</b> then, <b>(2) we fine-tune the model from the first step using an instruction dataset to enhance the language model's understanding of human instructions.</b>
 
 
 
22
 
23
  **The features of this project are as follows:**
24
 
25
- - We conducted full pre-training on LLaMA using the Chinese pre-training corpus we built, which improved the model's understanding of Chinese.
26
- - We utilized our Chinese instruction dataset, consisting of approximately 1.4 million samples, and performed LoRA fine-tuning to enhance the model's comprehension of human instructions.
27
- - We optimized the Information Extraction (IE) tasks, including Named Entity Recognition (NER), Relation Extraction (RE), and Event Extraction (EE), by utilizing human instructions to accomplish information extraction tasks.
28
- - We have open-sourced the weights of the pre-trained model and the LoRA weights used for instruction fine-tuning.
29
- - We have also made the full pre-training script available, which includes transformations, construction, and loading of large-scale corpora, as well as the LoRA instruction fine-tuning script.
30
 
31
 
32
- All weights have been uploaded to Hugging Face. The CaMA differential weights can be found [here](https://huggingface.co/zjunlp/CaMA-13B-Diff), and the LoRA weights can be found [here](https://huggingface.co/zjunlp/CaMA-13B-LoRA).
33
 
34
  ## Contents
35
 
36
- - Cases
37
  - [Pretraining Cases](#1-1)
38
  - [Information Extraction Cases](#1-2)
39
  - [General Ability Cases](#1-3)
40
- - Quick Start
41
  - [Environment Configuration](#2-1)
42
  - [Model Weight(Pretrain and LoRA)](#2-2)
43
  - [Model Usage Guide](#2-4)
44
  - [Information Extraction Prompt](#2-5)
45
- - Training Details
46
  - [Pertraining data and Pretraining scripts](#3-1)
47
  - [Instruction data and Instruction-tuning scripts](#3-3)
48
  - [Limitations](#4)
@@ -205,10 +208,14 @@ Our pre-trained model has demonstrated certain abilities in instruction followin
205
  The effectiveness of information extraction is illustrated in the following figure. We tested different instructions for different tasks as well as the same instructions for the same task, and achieved good results for all of them.
206
 
207
  <p align="center" width="100%">
208
- <a href="" target="_blank"><img src="https://github.com/zjunlp/CaMA/blob/main/assets/ie-case.jpg?raw=true" alt="IE" style="width: 60%; min-width: 60px; display: block; margin: auto;"></a>
209
  </p>
210
 
 
211
 
 
 
 
212
 
213
  <h3 id="1-3">1.3 General Ablities Cases</h3>
214
 
@@ -362,8 +369,8 @@ The effectiveness of information extraction is illustrated in the following figu
362
  <h3 id="2-1">2.1 Environment Configuration</h3>
363
 
364
  ```shell
365
- conda create -n cama python=3.9 -y
366
- conda activate cama
367
  pip install torch==1.12.0+cu116 torchvision==0.13.0+cu116 torchaudio==0.12.0 --extra-index-url https://download.pytorch.org/whl/cu116
368
  pip install -r requirements.txt
369
  ```
@@ -371,9 +378,9 @@ pip install -r requirements.txt
371
 
372
  <h3 id="2-2">2.2 Pretraining model weight acquisition and restoration</h3>
373
 
374
- > Since the Meta has not fully released the weights of LLaMA, we have computed the difference between the CaMA weights and the LLaMA weights and uploaded them [here](https://huggingface.co/zjunlp/CaMA-13B-Diff). To restore the complete CaMA weights, please follow the steps outlined below.
375
 
376
- **1. Download LLaMA 13B and CaMA-13B-Diff**
377
 
378
  Please click [here](https://forms.gle/jk851eBVbX1m5TAv5) to apply for the official pre-training weights of LLaMA from `meta`. In this case, we are using the `13B` version of the model, so you only need to download the `13B` version. Once downloaded, the file directory will be as follows:
379
 
@@ -388,10 +395,16 @@ Please click [here](https://forms.gle/jk851eBVbX1m5TAv5) to apply for the offici
388
  |-- tokenizer_checklist.chk
389
  ```
390
 
391
- You can use the following command to download the `CaMA-diff` file (assuming it is saved in the `./CaMA-Diff` folder):
 
 
 
 
 
392
  ```shell
393
- python tools/download.py --download_path ./CaMA-Diff --only_base
394
  ```
 
395
  > :exclamation:Noted. If the download is interrupted, please repeat the command mentioned above. HuggingFace provides the functionality of resumable downloads, allowing you to resume the download from where it was interrupted.
396
 
397
  **2. Use the conversion script provided by huggingface**
@@ -402,17 +415,23 @@ To convert the original LLaMA-13B model into the HuggingFace format, you can use
402
  python convert_llama_weights_to_hf.py --input_dir ./ --model_size 13B --output_dir ./converted
403
  ```
404
 
405
- **3. Restore CaMA 13B**
 
 
406
 
407
- Use the script we provided, located at `./tools/weight_diff.py`, execute the following command, and you will get the complete `CaMA` weight:
 
 
408
 
 
 
 
409
  ```shell
410
- python tools/weight_diff.py recover --path_raw ./converted --path_diff ./CaMA-Diff --path_tuned ./CaMA
411
  ```
412
 
413
- The final complete CaMA weights are saved in the `./CaMA` folder.
414
 
415
-
416
 
417
  <h3 id="2-3">2.3 Instruction tuning LoRA weight acquisition</h3>
418
 
@@ -430,26 +449,28 @@ The final complete weights are saved in the `./LoRA` folder.
430
 
431
  **1. Reproduce the results in Section 1**
432
 
433
- 1. If you want to reproduce the results in section `1.1`(**pretraining cases**), please run the following command (assuming that the complete pre-training weights of `CaMA` have been obtained according to the steps in section `2.2`, and the CaMA weight is saved in the `./CaMA` folder):
 
 
434
 
435
  ```shell
436
- python examples/generate_finetune.py --base_model ./CaMA
437
  ```
438
 
439
  The result in section `1.1` can be obtained.
440
 
441
- 2. If you want to reproduce the results in section `1.2`(**information extraction cases**), please run the following command (assuming that the LoRA weights of `CaMA` have been obtained according to the steps in section `2.3`, and the LoRA weights is saved in the `./LoRA` folder):
442
 
443
  ```shell
444
- python examples/generate_lora.py --load_8bit --base_model ./CaMA --lora_weights ./LoRA --run_ie_cases
445
  ```
446
 
447
  The result in section `1.2` can be obtained.
448
 
449
- 3. If you want to reproduce the results in section `1.3`(**general ablities cases**), please run the following command (assuming that the LoRA weights of `CaMA` have been obtained according to the steps in section `2.3`, and the LoRA weights is saved in the `./LoRA` folder):
450
 
451
  ```shell
452
- python examples/generate_lora.py --load_8bit --base_model ./CaMA --lora_weights ./LoRA --run_general_cases
453
  ```
454
 
455
  The result in section `1.3` can be obtained.
@@ -463,7 +484,7 @@ We offer two methods: the first one is **command-line interaction**, and the sec
463
  1. Use the following command to enter **command-line interaction**:
464
 
465
  ```shell
466
- python examples/generate_finetune.py --base_model ./CaMA --interactive
467
  ```
468
 
469
  The disadvantage is the inability to dynamically change decoding parameters.
@@ -471,24 +492,25 @@ We offer two methods: the first one is **command-line interaction**, and the sec
471
  2. Use the following command to enter **web-based interaction**:
472
 
473
  ```shell
474
- python examples/generate_finetune_web.py --base_model ./CaMA
475
  ```
476
  Here is a screenshot of the web-based interaction:
477
  <p align="center" width="100%">
478
- <a href="" target="_blank"><img src="https://github.com/zjunlp/CaMA/blob/main/assets/finetune_web.jpg?raw=true" alt="finetune-web" style="width: 100%; min-width: 100px; display: block; margin: auto;"></a>
479
  </p>
480
 
 
481
  **3. Usage of Instruction tuning Model**
482
 
483
  Here, we provide a web-based interaction method. Use the following command to access the web:
484
 
485
  ```shell
486
- python examples/generate_lora_web.py --base_model ./CaMA --lora_weights ./LoRA
487
  ```
488
 
489
  Here is a screenshot of the web-based interaction:
490
  <p align="center" width="100%">
491
- <a href="" target="_blank"><img src="https://github.com/zjunlp/CaMA/blob/main/assets/lora_web.png?raw=true" alt="finetune-web" style="width: 100%; min-width: 100px; display: block; margin: auto;"></a>
492
  </p>
493
 
494
  The `instruction` is a required parameter, while `input` is an optional parameter. For general tasks (such as the examples provided in section `1.3`), you can directly enter the input in the `instruction` field. For information extraction tasks (as shown in the example in section `1.2`), please enter the instruction in the `instruction` field and the sentence to be extracted in the `input` field. We provide an information extraction prompt in section `2.5`.
@@ -499,8 +521,9 @@ If you want to perform batch testing, please modify the `examples/generate_lora.
499
 
500
  <h3 id="2-5">2.5 Information Extraction Prompt</h3>
501
 
502
- For information extraction tasks such as named entity recognition (NER), event extraction (EE), and relation extraction (RE), we provide some prompts for ease of use. You can refer to this [link](./examples/ie_prompt.py) for examples. Of course, you can also try using your own prompts.
503
 
 
504
 
505
 
506
  <h2 id="3">3. Training Details</h2>
@@ -511,7 +534,7 @@ For information extraction tasks such as named entity recognition (NER), event e
511
  >
512
  > (2) Instruction tuning stage using LoRA. This stage enables the model to understand human instructions and generate appropriate responses.
513
 
514
- ![](https://github.com/zjunlp/CaMA/blob/main/assets/main.jpg?raw=true)
515
 
516
  <h3 id="3-1">3.1 Dataset Construction (Pretraining)</h3>
517
 
@@ -521,7 +544,7 @@ For the crawled datasets mentioned above, we employed a heuristic approach to fi
521
 
522
  <h3 id="3-2">3.2 Training Process (Pretraining)</h3>
523
 
524
- Detailed data processing code, training code, complete training scripts, and detailed training results can be found in [./pretrain](./pretrain).
525
 
526
  Before training, we need to tokenize the data. We set the maximum length of a single sample to `1024`, while most documents are much longer than this. Therefore, we need to partition these documents. **We designed a greedy algorithm to split the documents, with the goal of ensuring that each sample consists of complete sentences and minimizing the number of segments while maximizing the length of each sample.** Additionally, due to the diversity of data sources, we developed a comprehensive data preprocessing tool that can process and merge data from various sources. Finally, considering the large amount of data, loading it directly into memory would impose excessive hardware pressure. Therefore, we referred to [DeepSpeed-Megatron](https://github.com/bigscience-workshop/Megatron-DeepSpeed/tree/main/tools) and used the `mmap` method to process and load the data. This involves loading the indices into memory and accessing the corresponding data on disk when needed.
527
 
@@ -553,7 +576,10 @@ In addition, we manually constructed a general Chinese dataset and translated it
553
  | Information Extraction Datasets (English) | 537429 |
554
  | Information Extraction Datasets (Chinese) | 486768 |
555
 
556
-
 
 
 
557
 
558
  <h3 id="3-4">3.4 Training Process (Instruction tuning)</h3>
559
 
@@ -643,4 +669,4 @@ We are very grateful to the following open source projects for their help:
643
 
644
  - [Vicuna](https://vicuna.lmsys.org/)
645
 
646
- - [Llama-X](https://github.com/AetherCortex/Llama-X)
 
9
  ---
10
 
11
  <p align="center" width="100%">
12
+ <a href="" target="_blank"><img src="https://github.com/zjunlp/KnowLM/blob/main/assets/logo_zhixi.png?raw=true" alt="ZJU-KnowLM" style="width: 40%; min-width: 40px; display: block; margin: auto;"></a>
13
  </p>
14
 
15
 
16
+ > This is the result of the `ZhiXi-13B` LoRA weights. You can click [here](https://github.com/zjunlp/KnowLM) to learn more.
17
 
18
 
19
+ # Knowledgable Large Language Model Framework.
20
 
21
+ With the rapid development of deep learning technology, large language models such as ChatGPT have made substantial strides in the realm of natural language processing. However, these expansive models still encounter several challenges in acquiring and comprehending knowledge, including the difficulty of updating knowledge and potential knowledge discrepancies and biases, collectively known as knowledge fallacies. The KnowLM project endeavors to tackle these issues by launching an open-source large-scale knowledgable language model framework and releasing corresponding models.
22
+
23
+ The project's `initial phase` introduced a knowledge extraction LLM based on LLaMA, dubbed **ZhiXi** (**智析**, which means intelligent analysis of data for information extraction). To integrate the capacity of Chinese understanding into the language models without compromising their inherent knowledge, we firstly <b>(1) use Chinese corpora for the full-scale pre-training with LLaMA (13B), augment the language model's understanding of Chinese and improve its knowledge richness while retaining its original English and code capacities;</b> Then <b>(2) we fine-tune the model obtained from the first step with an instruction dataset, thus bolstering the language model's understanding of human instructions for knowledge extraction.</b>
24
+ - ❗Please note that this project is still undergoing optimization, and the model weights will be regularly updated to support new features and models!
25
 
26
  **The features of this project are as follows:**
27
 
28
+ - Centered on knowledge and large models, a **full-scale pre-training** of the large model, such as LLaMA, is conducted using the built Chinese&English pre-training corpus.
29
+ - Based on the technology of **KG2Instructions**, the knowledge extraction tasks, including NER, RE, and IE, are optimized and can be completed using human instructions.
30
+ - Using the built Chinese instruction dataset (approximately 1400K), LoRA fine-tuning is used to enhance the model's understanding of human instructions.
31
+ - The weights of the pre-training model and LoRA's instruction fine-tuning are open-sourced.
32
+ - The **full-scale pre-training code** (providing conversion, construction, and loading of large corpora) and **LoRA instruction fine-tuning code** are open-sourced (support multi-machine multi-GPU).
33
 
34
 
35
+ All weights have been uploaded to Hugging Face. The ZhiXi differential weights can be found [here](https://huggingface.co/zjunlp/zhixi-13B-Diff), and the LoRA weights can be found [here](https://huggingface.co/zjunlp/zhixi-13B-LoRA).
36
 
37
  ## Contents
38
 
39
+ - [Cases](#1)
40
  - [Pretraining Cases](#1-1)
41
  - [Information Extraction Cases](#1-2)
42
  - [General Ability Cases](#1-3)
43
+ - [Quick Start](#2)
44
  - [Environment Configuration](#2-1)
45
  - [Model Weight(Pretrain and LoRA)](#2-2)
46
  - [Model Usage Guide](#2-4)
47
  - [Information Extraction Prompt](#2-5)
48
+ - [Training Details](#3)
49
  - [Pertraining data and Pretraining scripts](#3-1)
50
  - [Instruction data and Instruction-tuning scripts](#3-3)
51
  - [Limitations](#4)
 
208
  The effectiveness of information extraction is illustrated in the following figure. We tested different instructions for different tasks as well as the same instructions for the same task, and achieved good results for all of them.
209
 
210
  <p align="center" width="100%">
211
+ <a href="" target="_blank"><img src="https://github.com/zjunlp/KnowLM/blob/main/assets/ie-case-new_logo-en.png?raw=true" alt="IE" style="width: 60%; min-width: 60px; display: block; margin: auto;"></a>
212
  </p>
213
 
214
+ Compared to other large models like ChatGPT, as shown in the graph, it can be observed that our model achieves more accurate and comprehensive extraction results. However, we have also identified some extraction errors in ZhiXi. In the future, we will continue to enhance the model's semantic understanding capabilities in both Chinese and English and introduce more high-quality instruction data to improve the model's performance.
215
 
216
+ <p align="center" width="100%">
217
+ <a href="" target="_blank"><img src="https://github.com/zjunlp/KnowLM/blob/main/assets/casevschatgpt.png?raw=true" alt="IE-cases-vs-chatgpt"style="width: 60%; min-width: 60px; display: block; margin: auto;"></a>
218
+ </p>
219
 
220
  <h3 id="1-3">1.3 General Ablities Cases</h3>
221
 
 
369
  <h3 id="2-1">2.1 Environment Configuration</h3>
370
 
371
  ```shell
372
+ conda create -n zhixi python=3.9 -y
373
+ conda activate zhixi
374
  pip install torch==1.12.0+cu116 torchvision==0.13.0+cu116 torchaudio==0.12.0 --extra-index-url https://download.pytorch.org/whl/cu116
375
  pip install -r requirements.txt
376
  ```
 
378
 
379
  <h3 id="2-2">2.2 Pretraining model weight acquisition and restoration</h3>
380
 
381
+ ❗❗❗ Note that in terms of hardware, performing step `2.2`, which involves merging LLaMA-13B with ZhiXI-13B-Diff, requires approximately **100GB** of RAM, with no demand for VRAM (this is due to the memory overhead caused by our merging strategy. To facilitate usage, we will improve our merging approach in future updates, and we are currently developing a 7B model as well, so stay tuned). For step `2.4`, which involves inference using `ZhiXi`, a minimum of **26GB** of VRAM is required.
382
 
383
+ **1. Download LLaMA 13B and ZhiXi-13B-Diff**
384
 
385
  Please click [here](https://forms.gle/jk851eBVbX1m5TAv5) to apply for the official pre-training weights of LLaMA from `meta`. In this case, we are using the `13B` version of the model, so you only need to download the `13B` version. Once downloaded, the file directory will be as follows:
386
 
 
395
  |-- tokenizer_checklist.chk
396
  ```
397
 
398
+ You can use the following command to download the `ZhiXi-13B-Diff` file (assuming it is saved in the `./zhixi-diff` folder):
399
+ ```shell
400
+ python tools/download.py --download_path ./zhixi-diff --only_base
401
+ ```
402
+
403
+ If you want to download the diff weights in the fp16 format, please use the following command (assuming it is saved in the `./zhixi-diff-fp16` folder):
404
  ```shell
405
+ python tools/download.py --download_path ./zhixi-diff-fp16 --only_base --fp16
406
  ```
407
+
408
  > :exclamation:Noted. If the download is interrupted, please repeat the command mentioned above. HuggingFace provides the functionality of resumable downloads, allowing you to resume the download from where it was interrupted.
409
 
410
  **2. Use the conversion script provided by huggingface**
 
415
  python convert_llama_weights_to_hf.py --input_dir ./ --model_size 13B --output_dir ./converted
416
  ```
417
 
418
+ **3. Restore ZhiXi 13B**
419
+
420
+ Use the script we provided, located at `./tools/weight_diff.py`, execute the following command, and you will get the complete `ZhiXi` weight:
421
 
422
+ ```shell
423
+ python tools/weight_diff.py recover --path_raw ./converted --path_diff ./zhixi-diff --path_tuned ./zhixi
424
+ ```
425
 
426
+ The final complete ZhiXi weights are saved in the `./zhixi` folder.
427
+
428
+ If you have downloaded the diff weights version in fp16 format, you can obtain them using the following command. Please note that there might be slight differences compared to the weights obtained in fp32 format:
429
  ```shell
430
+ python tools/weight_diff.py recover --path_raw ./converted --path_diff ./zhixi-diff-fp16 --path_tuned ./zhixi
431
  ```
432
 
433
+ > ❗NOTE. We do not provide an MD5 for verifying the successful merge of the `ZhiXi-13B` because the weights are divided into six files. We employ the same validation strategy as [Stanford Alpaca](https://github.com/tatsu-lab/stanford_alpaca), which involves performing a sum check on the weights (you can refer to this [link](https://github.com/zjunlp/KnowLLM/blob/main/tools/weight_diff.py#L108)). **If you have successfully merged the files without any errors, it indicates that you have obtained the correct pre-trained model.**
434
 
 
435
 
436
  <h3 id="2-3">2.3 Instruction tuning LoRA weight acquisition</h3>
437
 
 
449
 
450
  **1. Reproduce the results in Section 1**
451
 
452
+ > The cases in `Section 1` were all run on V100. If running on other devices, the results may vary. Please run multiple times or change the decoding parameters.
453
+
454
+ 1. If you want to reproduce the results in section `1.1`(**pretraining cases**), please run the following command (assuming that the complete pre-training weights of `ZhiXi` have been obtained according to the steps in section `2.2`, and the ZhiXi weight is saved in the `./zhixi` folder):
455
 
456
  ```shell
457
+ python examples/generate_finetune.py --base_model ./zhixi
458
  ```
459
 
460
  The result in section `1.1` can be obtained.
461
 
462
+ 2. If you want to reproduce the results in section `1.2`(**information extraction cases**), please run the following command (assuming that the LoRA weights of `ZhiXi` have been obtained according to the steps in section `2.3`, and the LoRA weights is saved in the `./lora` folder):
463
 
464
  ```shell
465
+ python examples/generate_lora.py --load_8bit --base_model ./zhixi --lora_weights ./lora --run_ie_cases
466
  ```
467
 
468
  The result in section `1.2` can be obtained.
469
 
470
+ 3. If you want to reproduce the results in section `1.3`(**general ablities cases**), please run the following command (assuming that the LoRA weights of `ZhiXi` have been obtained according to the steps in section `2.3`, and the LoRA weights is saved in the `./lora` folder):
471
 
472
  ```shell
473
+ python examples/generate_lora.py --load_8bit --base_model ./zhixi --lora_weights ./lora --run_general_cases
474
  ```
475
 
476
  The result in section `1.3` can be obtained.
 
484
  1. Use the following command to enter **command-line interaction**:
485
 
486
  ```shell
487
+ python examples/generate_finetune.py --base_model ./zhixi --interactive
488
  ```
489
 
490
  The disadvantage is the inability to dynamically change decoding parameters.
 
492
  2. Use the following command to enter **web-based interaction**:
493
 
494
  ```shell
495
+ python examples/generate_finetune_web.py --base_model ./zhixi
496
  ```
497
  Here is a screenshot of the web-based interaction:
498
  <p align="center" width="100%">
499
+ <a href="" target="_blank"><img src="https://github.com/zjunlp/KnowLM/blob/main/assets/finetune_web.jpg?raw=true" alt="finetune-web" style="width: 100%; min-width: 100px; display: block; margin: auto;"></a>
500
  </p>
501
 
502
+
503
  **3. Usage of Instruction tuning Model**
504
 
505
  Here, we provide a web-based interaction method. Use the following command to access the web:
506
 
507
  ```shell
508
+ python examples/generate_lora_web.py --base_model ./zhixi --lora_weights ./lora
509
  ```
510
 
511
  Here is a screenshot of the web-based interaction:
512
  <p align="center" width="100%">
513
+ <a href="" target="_blank"><img src="https://github.com/zjunlp/KnowLM/blob/main/assets/lora_web.png?raw=true" alt="finetune-web" style="width: 100%; min-width: 100px; display: block; margin: auto;"></a>
514
  </p>
515
 
516
  The `instruction` is a required parameter, while `input` is an optional parameter. For general tasks (such as the examples provided in section `1.3`), you can directly enter the input in the `instruction` field. For information extraction tasks (as shown in the example in section `1.2`), please enter the instruction in the `instruction` field and the sentence to be extracted in the `input` field. We provide an information extraction prompt in section `2.5`.
 
521
 
522
  <h3 id="2-5">2.5 Information Extraction Prompt</h3>
523
 
524
+ For information extraction tasks such as named entity recognition (NER), event extraction (EE), and relation extraction (RE), we provide some prompts for ease of use. You can refer to this [link](https://github.com/zjunlp/KnowLM/blob/main/examples/ie_prompt.py) for examples. Of course, you can also try using your own prompts.
525
 
526
+ Here is a [case](https://github.com/zjunlp/DeepKE/blob/main/example/llm/InstructKGC/README.md) where ZhiXi-13B-LoRA is used to accomplish the instruction-based knowledge graph construction task in CCKS2023.
527
 
528
 
529
  <h2 id="3">3. Training Details</h2>
 
534
  >
535
  > (2) Instruction tuning stage using LoRA. This stage enables the model to understand human instructions and generate appropriate responses.
536
 
537
+ ![](https://github.com/zjunlp/KnowLM/blob/main/assets/main_new.jpg?raw=true)
538
 
539
  <h3 id="3-1">3.1 Dataset Construction (Pretraining)</h3>
540
 
 
544
 
545
  <h3 id="3-2">3.2 Training Process (Pretraining)</h3>
546
 
547
+ Detailed data processing code, training code, complete training scripts, and detailed training results can be found in [./pretrain](https://github.com/zjunlp/KnowLM/blob/main/pretrain).
548
 
549
  Before training, we need to tokenize the data. We set the maximum length of a single sample to `1024`, while most documents are much longer than this. Therefore, we need to partition these documents. **We designed a greedy algorithm to split the documents, with the goal of ensuring that each sample consists of complete sentences and minimizing the number of segments while maximizing the length of each sample.** Additionally, due to the diversity of data sources, we developed a comprehensive data preprocessing tool that can process and merge data from various sources. Finally, considering the large amount of data, loading it directly into memory would impose excessive hardware pressure. Therefore, we referred to [DeepSpeed-Megatron](https://github.com/bigscience-workshop/Megatron-DeepSpeed/tree/main/tools) and used the `mmap` method to process and load the data. This involves loading the indices into memory and accessing the corresponding data on disk when needed.
550
 
 
576
  | Information Extraction Datasets (English) | 537429 |
577
  | Information Extraction Datasets (Chinese) | 486768 |
578
 
579
+ **Flow diagram of KG2Instruction and other instruction fine-tuning datasets**
580
+ <p align="center" width="100%">
581
+ <a href="" target="_blank"><img src="https://github.com/zjunlp/KnowLM/blob/main/assets/kg2instructions-en.png?raw=true"style="width: 90%; min-width: 90px; display: block; margin: auto;"></a>
582
+ </p>
583
 
584
  <h3 id="3-4">3.4 Training Process (Instruction tuning)</h3>
585
 
 
669
 
670
  - [Vicuna](https://vicuna.lmsys.org/)
671
 
672
+ - [Llama-X](https://github.com/AetherCortex/Llama-X)