Update README.md
#1
by
izhx
- opened
- README.md +81 -3
- README_zh.md +76 -0
README.md
CHANGED
@@ -1,3 +1,81 @@
|
|
1 |
-
---
|
2 |
-
license: apache-2.0
|
3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
---
|
4 |
+
|
5 |
+
**English** | [中文](./README_zh.md)
|
6 |
+
|
7 |
+
<!-- [Arxiv PDF](https://arxiv.org/pdf/2407.19669), [HF paper page](https://huggingface.co/papers/2407.19669)
|
8 |
+
-->
|
9 |
+
|
10 |
+
## Code implementation of Qwen2 based embeddings
|
11 |
+
|
12 |
+
This model code is for Qwen2 based embedding models.
|
13 |
+
|
14 |
+
We enable the bidirectional attention by default.
|
15 |
+
|
16 |
+
### Usage
|
17 |
+
|
18 |
+
1. Download the `configuration.py` and `modeling.py` to your saved `gte-Qwen2` model directory.
|
19 |
+
2. Replace the `modeling_qwen.` with `modeling.` in `auto_map` field of `config.json`.
|
20 |
+
|
21 |
+
|
22 |
+
### Recommendation: Enable Unpadding and Acceleration with `xformers`
|
23 |
+
|
24 |
+
This code supports the acceleration of attention computations using `xformers`,
|
25 |
+
which can automatically choose the optimal implementation based on the type of device, such as `flash_attn`.
|
26 |
+
Therefore, we can also achieve significant acceleration on old devices like the V100.
|
27 |
+
|
28 |
+
|
29 |
+
Firstly, install `xformers` (with `pytorch` pre-installed):
|
30 |
+
```
|
31 |
+
if pytorch is installed using conda:
|
32 |
+
conda install xformers -c xformers
|
33 |
+
elif pytorch is installed using pip:
|
34 |
+
# cuda 11.8 version
|
35 |
+
pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu118
|
36 |
+
# cuda 12.1 version
|
37 |
+
pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu121
|
38 |
+
```
|
39 |
+
For more information, refer to [Installing xformers](https://github.com/facebookresearch/xformers?tab=readme-ov-file#installing-xformers).
|
40 |
+
|
41 |
+
Then, when loading the model, set `unpad_inputs` and `use_memory_efficient_attention` to `true`,
|
42 |
+
and set `torch_dtype` to `torch.float16` (or `torch.bfloat16`) to achieve the acceleration.
|
43 |
+
|
44 |
+
```python
|
45 |
+
import torch
|
46 |
+
from transformers import AutoModel, AutoTokenizer
|
47 |
+
|
48 |
+
path = 'Alibaba-NLP/gte-Qwen2-1.5B-instruct'
|
49 |
+
device = torch.device('cuda')
|
50 |
+
tokenzier = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
|
51 |
+
model = AutoModel.from_pretrained(
|
52 |
+
path,
|
53 |
+
trust_remote_code=True,
|
54 |
+
unpad_inputs=True,
|
55 |
+
use_memory_efficient_attention=True,
|
56 |
+
torch_dtype=torch.float16
|
57 |
+
).to(device)
|
58 |
+
|
59 |
+
inputs = tokenzier(['test input'], truncation=True, max_length=8192, padding=True, return_tensors='pt')
|
60 |
+
|
61 |
+
with torch.inference_mode():
|
62 |
+
outputs = model(**inputs.to(device))
|
63 |
+
|
64 |
+
```
|
65 |
+
|
66 |
+
Alternatively, you can directly modify the `unpad_inputs` and `use_memory_efficient_attention` settings to `true` in the model's `config.json`,
|
67 |
+
eliminating the need to set them in the code.
|
68 |
+
|
69 |
+
|
70 |
+
## Citation
|
71 |
+
```
|
72 |
+
@misc{zhang2024mgte,
|
73 |
+
title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval},
|
74 |
+
author={Xin Zhang and Yanzhao Zhang and Dingkun Long and Wen Xie and Ziqi Dai and Jialong Tang and Huan Lin and Baosong Yang and Pengjun Xie and Fei Huang and Meishan Zhang and Wenjie Li and Min Zhang},
|
75 |
+
year={2024},
|
76 |
+
eprint={2407.19669},
|
77 |
+
archivePrefix={arXiv},
|
78 |
+
primaryClass={cs.CL},
|
79 |
+
url={https://arxiv.org/abs/2407.19669},
|
80 |
+
}
|
81 |
+
```
|
README_zh.md
ADDED
@@ -0,0 +1,76 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
---
|
4 |
+
|
5 |
+
[English](./README.md) | **中文**
|
6 |
+
|
7 |
+
<!-- [Arxiv PDF](https://arxiv.org/pdf/2407.19669), [HF paper page](https://huggingface.co/papers/2407.19669) -->
|
8 |
+
|
9 |
+
## Qwen2 模型代码实现
|
10 |
+
|
11 |
+
此模型代码适用于基于`Qwen2` 的文本表示模型。
|
12 |
+
|
13 |
+
默认启用双向注意力机制。
|
14 |
+
|
15 |
+
|
16 |
+
### 使用方法
|
17 |
+
|
18 |
+
1. 下载此仓库中的 `configuration.py` 和 `modeling.py` 到你本地保存的 `gte-Qwen2` 模型目录
|
19 |
+
2. 将 `config.json` 的 `auto_map` 下所有的 `modeling_qwen.` 替换为 `modeling.`
|
20 |
+
|
21 |
+
### 推荐:启用 Unpadding 和 xformers 加速
|
22 |
+
|
23 |
+
此代码支持使用 `xformers` 加速 attention 计算,可以根据设备类型自动选择优化实现,比如 `flash_attn`。
|
24 |
+
通过 `xformers`,在不能支持 `flash_attn` 的旧设备比如`V100`上也可以获得极大的加速。
|
25 |
+
|
26 |
+
首先,安装 `xformers`(需要预先安装`pytorch`):
|
27 |
+
```
|
28 |
+
if pytorch 使用 conda 安装 :
|
29 |
+
conda install xformers -c xformers
|
30 |
+
|
31 |
+
elif pytorch 使用 pip 安装 :
|
32 |
+
# cuda 11.8 version
|
33 |
+
pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu118
|
34 |
+
# cuda 12.1 version
|
35 |
+
pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu121
|
36 |
+
```
|
37 |
+
更多信息可参考 [installing-xformers](https://github.com/facebookresearch/xformers?tab=readme-ov-file#installing-xformers)。
|
38 |
+
|
39 |
+
然后,加载模型时设置 `unpad_inputs` 和 `use_memory_efficient_attention` 为 `true`,并设置 `torch_dtype` 为 `torch.float16` (or `torch.bfloat16`),即可获得加速。
|
40 |
+
|
41 |
+
```python
|
42 |
+
import torch
|
43 |
+
from transformers import AutoModel, AutoTokenizer
|
44 |
+
|
45 |
+
path = 'Alibaba-NLP/gte-Qwen2-1.5B-instruct'
|
46 |
+
device = torch.device('cuda')
|
47 |
+
tokenzier = AutoTokenizer.from_pretrained(path)
|
48 |
+
model = AutoModel.from_pretrained(
|
49 |
+
path,
|
50 |
+
trust_remote_code=True,
|
51 |
+
unpad_inputs=True,
|
52 |
+
use_memory_efficient_attention=True,
|
53 |
+
torch_dtype=torch.float16
|
54 |
+
).to(device)
|
55 |
+
|
56 |
+
inputs = tokenzier(['test input'], truncation=True, max_length=8192, padding=True, return_tensors='pt')
|
57 |
+
|
58 |
+
with torch.inference_mode():
|
59 |
+
outputs = model(**inputs.to(device))
|
60 |
+
|
61 |
+
```
|
62 |
+
也可以直接修改模型的 `config.json` 中 `unpad_inputs` 和 `use_memory_efficient_attention` 为 `true`,省去代码中的设置。
|
63 |
+
|
64 |
+
|
65 |
+
## Citation
|
66 |
+
```
|
67 |
+
@misc{zhang2024mgte,
|
68 |
+
title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval},
|
69 |
+
author={Xin Zhang and Yanzhao Zhang and Dingkun Long and Wen Xie and Ziqi Dai and Jialong Tang and Huan Lin and Baosong Yang and Pengjun Xie and Fei Huang and Meishan Zhang and Wenjie Li and Min Zhang},
|
70 |
+
year={2024},
|
71 |
+
eprint={2407.19669},
|
72 |
+
archivePrefix={arXiv},
|
73 |
+
primaryClass={cs.CL},
|
74 |
+
url={https://arxiv.org/abs/2407.19669},
|
75 |
+
}
|
76 |
+
```
|