add link to doc2query paper
Browse files
README.md
CHANGED
@@ -32,6 +32,8 @@ A [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) fine-tuned
|
|
32 |
* synthetic query generation for downstream embedding fine-tuning tasks - when you have only documents and no queries/labels. Such task can be done with the [nixietune](https://github.com/nixiesearch/nixietune) toolkit, see the `nixietune.qgen.generate` recipe.
|
33 |
* synthetic dataset expansion for further embedding training - when you DO have query-document pairs, but only a few. You can fine-tune the `nixie-querygen-v2` on existing pairs, and then expand your document corpus with synthetic queries (which are still based on your few real ones). See `nixietune.qgen.train` recipe.
|
34 |
|
|
|
|
|
35 |
## Training data
|
36 |
|
37 |
We used [200k query-document pairs](https://huggingface.co/datasets/nixiesearch/query-positive-pairs-small) sampled randomly from a diverse set of IR datasets:
|
|
|
32 |
* synthetic query generation for downstream embedding fine-tuning tasks - when you have only documents and no queries/labels. Such task can be done with the [nixietune](https://github.com/nixiesearch/nixietune) toolkit, see the `nixietune.qgen.generate` recipe.
|
33 |
* synthetic dataset expansion for further embedding training - when you DO have query-document pairs, but only a few. You can fine-tune the `nixie-querygen-v2` on existing pairs, and then expand your document corpus with synthetic queries (which are still based on your few real ones). See `nixietune.qgen.train` recipe.
|
34 |
|
35 |
+
The idea behind the approach is taken from the [doqT5query](https://github.com/castorini/docTTTTTquery) model. See the original paper [Rodrigo Nogueira and Jimmy Lin. From doc2query to docTTTTTquery.](https://cs.uwaterloo.ca/~jimmylin/publications/Nogueira_Lin_2019_docTTTTTquery-v2.pdf)
|
36 |
+
|
37 |
## Training data
|
38 |
|
39 |
We used [200k query-document pairs](https://huggingface.co/datasets/nixiesearch/query-positive-pairs-small) sampled randomly from a diverse set of IR datasets:
|