Spaces:
Build error
Build error
avacaondata
commited on
Commit
•
851edbd
1
Parent(s):
93a0690
añadidos cambios al article
Browse files- article_app.py +7 -5
article_app.py
CHANGED
@@ -1,6 +1,5 @@
|
|
1 |
article = """
|
2 |
<img src="https://www.iic.uam.es/wp-content/uploads/2017/12/IIC_logoP.png">
|
3 |
-
<img src="https://drive.google.com/uc?export=view&id=1_iUdUMPR5u1p9767YVRbCZkobt_fOozD">
|
4 |
|
5 |
<p style="text-align: justify;"> This app is developed by the aforementioned members of <a href="https://www.iic.uam.es/">IIC - Instituto de Ingeniería del Conocimiento</a> as part of the <a href="https://somosnlp.org/hackathon">Somos PLN Hackaton 2022.</a>
|
6 |
|
@@ -16,7 +15,7 @@ a very good impact on society. Health is a hot topic today but should be always
|
|
16 |
We identified the need for strong intelligent information retrieval systems. Imagine a Siri that could generate coherent answers for your questions, instead of simplistic google search for you. That is the technology we envision, to which we would like the Spanish community of
|
17 |
NLP to get a little step closer.
|
18 |
|
19 |
-
The main technical objective of this app is to expand the existing tools regarding long form question answering in Spanish, by introducing new generative methods together with a complete architecture of good performing models, producing interesting results in a variety of examples tried
|
20 |
In fact, multiple novel methods in Spanish have been introduced to build this app.
|
21 |
|
22 |
Most of these systems currently rely on Sentence Transformers for passage retrieval (which we wanted to improve by creating Dense Passage Retrieval in Spanish), and use Extractive Question Answering methods. This means that the user needs to look
|
@@ -30,7 +29,9 @@ but also for building a Dense Passage Retrieval (DPR) dataset to train a DPR mod
|
|
30 |
|
31 |
The fragility of the solution we devised, and therefore also the most beautiful side of it when it works, is that every piece must work perfectly for the final answer to be correct. If our Speech2Text system is not
|
32 |
good enough, the transcripted text will come corrupt to the DPR, therefore no relevant documents will be retrieved, and the answer will be poor. Similarly, if the DPR is not correctly trained and is not able to identify the relevant passages for a query, the result will be bad.
|
33 |
-
This also served as a motivation, as the technical difficulty was completely worth it in cased it worked.
|
|
|
|
|
34 |
|
35 |
Regarding the Speech2Text, there were existing solutions trained on Commonvoice; however, there were no Spanish models trained with big datasets like MultiLibrispeech-es, which we used following the results reported in Meta's paper (more info in the linked wav2vec2 model above). We also decided
|
36 |
to train the large version of wav2vec2, as the other ASR models that were available were 300M parameter models, therefore we also wanted to improve on that part, not only on the dataset used. We obtained a WER of 0.073, which is arguably low compared to the rest of the existing models on ASR
|
@@ -39,9 +40,8 @@ datasets in Spanish. Further research should be made to compare all of these mod
|
|
39 |
Another contribution we wanted to make with this project was a good performing ranker in Spanish. This is a piece we include after the DPR to select the top passages for a query to rank passages based on relevance to the query. Although there are multilingual open source solutions, there are no Spanish monolingual models in this regard.
|
40 |
For that, we trained CrossEncoder, for which we automatically translated <a href="https://microsoft.github.io/msmarco/">MS Marco</a> with Transformer, which has around 200k query-passage pairs, if we take 1 positive to 4 negative rate from the papers. MS Marco is the dataset typically used in English to train crossencoders for ranking.
|
41 |
|
42 |
-
|
43 |
Finally, there are certainly not generative question answering datasets in Spanish. For that reason, we used LFQA, as mentioned above. It has over 400k data instances, which we also translated with Transformers.
|
44 |
-
Our translation methods needed to work correclty, since the passages were too large for the max sequence length of the translation model and there were
|
45 |
We solved those problems with intelligent text splitting and reconstruction and efficient configuration for the translation process. Thanks to this dataset we could train 2 generative models, for which we used our expertise on generative language models in order to train them effectively.
|
46 |
The reason for including audio as a possible input and output is because we wanted to make the App much more accessible to everyone. With this App we want to put biomedical knowledge in Spanish within everyone's reach.
|
47 |
|
@@ -50,6 +50,8 @@ System Architecture
|
|
50 |
</h3>
|
51 |
Below you can find all the pieces that form the system. This section is minimalist so that the user can get a broad view of the general inner working of the app, and then travel through each model and dataset where they will find much more information on each piece of the system.
|
52 |
|
|
|
|
|
53 |
<ol>
|
54 |
<li><a href="https://hf.co/IIC/wav2vec2-spanish-multilibrispeech">Speech2Text</a>: For this we finedtuned a multilingual Wav2Vec2, as explained in the attached link. We use this model to process audio questions.</li>
|
55 |
<li><a href="https://hf.co/IIC/dpr-spanish-passage_encoder-allqa-base">Dense Passage Retrieval (DPR) for Context</a>: Dense Passage Retrieval is a methodology <a href="https://arxiv.org/abs/2004.04906">developed by Facebook</a> which is currently the SoTA for Passage Retrieval, that is, the task of getting the most relevant passages to answer a given question. You can find details about how it was trained on the link attached to the name. </li>
|
|
|
1 |
article = """
|
2 |
<img src="https://www.iic.uam.es/wp-content/uploads/2017/12/IIC_logoP.png">
|
|
|
3 |
|
4 |
<p style="text-align: justify;"> This app is developed by the aforementioned members of <a href="https://www.iic.uam.es/">IIC - Instituto de Ingeniería del Conocimiento</a> as part of the <a href="https://somosnlp.org/hackathon">Somos PLN Hackaton 2022.</a>
|
5 |
|
|
|
15 |
We identified the need for strong intelligent information retrieval systems. Imagine a Siri that could generate coherent answers for your questions, instead of simplistic google search for you. That is the technology we envision, to which we would like the Spanish community of
|
16 |
NLP to get a little step closer.
|
17 |
|
18 |
+
The main technical objective of this app is to expand the existing tools regarding long form question answering in Spanish, by introducing new generative methods together with a complete architecture of good performing models, producing interesting results in a variety of examples tried.
|
19 |
In fact, multiple novel methods in Spanish have been introduced to build this app.
|
20 |
|
21 |
Most of these systems currently rely on Sentence Transformers for passage retrieval (which we wanted to improve by creating Dense Passage Retrieval in Spanish), and use Extractive Question Answering methods. This means that the user needs to look
|
|
|
29 |
|
30 |
The fragility of the solution we devised, and therefore also the most beautiful side of it when it works, is that every piece must work perfectly for the final answer to be correct. If our Speech2Text system is not
|
31 |
good enough, the transcripted text will come corrupt to the DPR, therefore no relevant documents will be retrieved, and the answer will be poor. Similarly, if the DPR is not correctly trained and is not able to identify the relevant passages for a query, the result will be bad.
|
32 |
+
This also served as a motivation, as the technical difficulty was completely worth it in cased it worked. Moreover, it would serve for us as a service to the NLP community in Spanish, as for building this app we would use much of what we learned from the private sector in building good performing systems
|
33 |
+
relying on multiple models to deliver to the community top performing models for Question Answering related tasks, thus participating in the Open Source culture and expansion of knowledge. Another objective we had, then, was to give a practical example sample of good practices,
|
34 |
+
which fits with the didactic character of both the organization and the Hackaton.
|
35 |
|
36 |
Regarding the Speech2Text, there were existing solutions trained on Commonvoice; however, there were no Spanish models trained with big datasets like MultiLibrispeech-es, which we used following the results reported in Meta's paper (more info in the linked wav2vec2 model above). We also decided
|
37 |
to train the large version of wav2vec2, as the other ASR models that were available were 300M parameter models, therefore we also wanted to improve on that part, not only on the dataset used. We obtained a WER of 0.073, which is arguably low compared to the rest of the existing models on ASR
|
|
|
40 |
Another contribution we wanted to make with this project was a good performing ranker in Spanish. This is a piece we include after the DPR to select the top passages for a query to rank passages based on relevance to the query. Although there are multilingual open source solutions, there are no Spanish monolingual models in this regard.
|
41 |
For that, we trained CrossEncoder, for which we automatically translated <a href="https://microsoft.github.io/msmarco/">MS Marco</a> with Transformer, which has around 200k query-passage pairs, if we take 1 positive to 4 negative rate from the papers. MS Marco is the dataset typically used in English to train crossencoders for ranking.
|
42 |
|
|
|
43 |
Finally, there are certainly not generative question answering datasets in Spanish. For that reason, we used LFQA, as mentioned above. It has over 400k data instances, which we also translated with Transformers.
|
44 |
+
Our translation methods needed to work correclty, since the passages were too large for the max sequence length of the translation model and there were 400k x 3 (answer, question, passages) texts to translate.
|
45 |
We solved those problems with intelligent text splitting and reconstruction and efficient configuration for the translation process. Thanks to this dataset we could train 2 generative models, for which we used our expertise on generative language models in order to train them effectively.
|
46 |
The reason for including audio as a possible input and output is because we wanted to make the App much more accessible to everyone. With this App we want to put biomedical knowledge in Spanish within everyone's reach.
|
47 |
|
|
|
50 |
</h3>
|
51 |
Below you can find all the pieces that form the system. This section is minimalist so that the user can get a broad view of the general inner working of the app, and then travel through each model and dataset where they will find much more information on each piece of the system.
|
52 |
|
53 |
+
<img src="https://drive.google.com/uc?export=view&id=1_iUdUMPR5u1p9767YVRbCZkobt_fOozD">
|
54 |
+
|
55 |
<ol>
|
56 |
<li><a href="https://hf.co/IIC/wav2vec2-spanish-multilibrispeech">Speech2Text</a>: For this we finedtuned a multilingual Wav2Vec2, as explained in the attached link. We use this model to process audio questions.</li>
|
57 |
<li><a href="https://hf.co/IIC/dpr-spanish-passage_encoder-allqa-base">Dense Passage Retrieval (DPR) for Context</a>: Dense Passage Retrieval is a methodology <a href="https://arxiv.org/abs/2004.04906">developed by Facebook</a> which is currently the SoTA for Passage Retrieval, that is, the task of getting the most relevant passages to answer a given question. You can find details about how it was trained on the link attached to the name. </li>
|