jupyterjazz
commited on
Commit
•
712be5f
1
Parent(s):
4bfe854
Update README.md
Browse files
README.md
CHANGED
@@ -101,4 +101,25 @@ language:
|
|
101 |
- zh
|
102 |
---
|
103 |
|
104 |
-
Modified version of https://huggingface.co/jinaai/xlm-roberta-flash-implementation for the onnx conversion
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
101 |
- zh
|
102 |
---
|
103 |
|
104 |
+
Modified version of [xlm-roberta-flash-implementation](https://huggingface.co/jinaai/xlm-roberta-flash-implementation) for the onnx conversion
|
105 |
+
|
106 |
+
## Brief Summary of Challenges and Modifications:
|
107 |
+
### Dynamic Matrix Calculation in RoPE
|
108 |
+
The original RoPE implementation did not compute the entire rotation matrix at the start. Instead, it calculated the matrix only for the required sequence length, cached it, and recalculated if a longer sequence came as input. This approach isn't compatible with ONNX, which requires a fixed graph during inference. To solve this, I now calculate the entire rotation matrix in advance.
|
109 |
+
|
110 |
+
### Custom Backward Functions for RoPE
|
111 |
+
We have custom forward and backward functions for RoPE. ONNX does not support custom backward functions, but since we only need forward passes for inference with ONNX, I removed the backward function completely.
|
112 |
+
|
113 |
+
### ONNX Model Size Limitation
|
114 |
+
ONNX stores the model in a protobuf format, which has a maximum size limit of 2GB. Our model was too large to fit this limit, so I had to store the model's parameters as external data files.
|
115 |
+
|
116 |
+
### Lack of Support for the `unique()` Function
|
117 |
+
We used the `unique()` function to identify unique task types in a batch, which is important when there are multiple task types. However, ONNX does not support the unique() function. For inference, having multiple task types in a batch is not important. Therefore, I modified the code to use the `task_id` argument—an integer that works for every text in a batch—instead of the `adapter_mask`, which was a tensor specifying an independent task ID for each text in the batch.
|
118 |
+
|
119 |
+
|
120 |
+
|
121 |
+
|
122 |
+
|
123 |
+
|
124 |
+
|
125 |
+
|