Spaces:

Endre
/

SemanticSearch-HU

Runtime error

SemanticSearch-HU / approach.txt

endre sukosd

Semantic Search HU implementation

3992084 almost 3 years ago

2.16 kB


	Types of Question Answering
	- extractive question answering (encoder only models BERT)
	- posing questions about a document and identifying the answers as spans of text in the document itself
	- generative question answering (encoder-decoder T5/BART)
	- open ended questions, which need to synthesize information
	- retrieval based/community question answering



	First approach - translate dataset, fine-tune model
	!Not really feasible, because it needs lots of human evaluation for correctly determine answer start token

	1. Translate English QA dataset into Hungarian
	- SQuAD - reading comprehension based on Wikipedia articles
	- ~ 100.000 question/answers
	2. Fine-tune a model and evaluate on this dataset


	Second approach - fine-tune multilingual model
	!MQA format different than SQuAD, cannot use ModelForQuestionAnswering

	1. Use a Hungarian dataset
	- MQA - multilingual parsed from Common Crawl
	- FAQ - 878.385 (2.415 domain)
	- CQA - 27.639 (171 domain)
	2. Fine-tune and evaluate a model on this dataset


	Possible steps:
	- Use an existing pre-trained model in Hungarian/Romanian/or multilingual to generate embeddings
	- Select Model:
	- multilingual which includes hu:
	- distiluse-base-multilingual-cased-v2 (400MB)
	- paraphrase-multilingual-MiniLM-L12-v2 (400MB) - fastest
	- paraphrase-multilingual-mpnet-base-v2 (900MB) - best performing
	- hubert
	- Select a dataset
	- use MQA hungarian subset
	- use hungarian wikipedia pages data, split it up
	- DBpedia, shortened abstracts = 500.000
	- Pre-compute embeddings for all answers/paragraphs
	- Compute embedding for incoming query
	- Compare similarity between query embedding and precomputed
	- return top-3 answers/questions

	Alternative steps:
	- train a sentence transformer on the Hungarian / Romanian subsets
	- Use the trained sentence transformer to generate embeddings