Spaces:
Running
on
CPU Upgrade
Law tab & Google Gecko
Looks like below:
(still need to remove SONAR & jina as it's a different dataset)
Relevant GitHub PRs:
https://github.com/embeddings-benchmark/mteb/pull/311
As detailed in the PR, I think mixing domain & language tabs is just temporary and once there is a significant amount of both, we can split them up into separate tab lines I think. Maybe we can also have people select them similar to the nice UI by @tomaarsen in https://huggingface.co/spaces/mteb/leaderboard/discussions/89
I think mixing domain & language tabs is just temporary and once there is a significant amount of both, we can split them up into separate tab lines I think.
That works for me.
This PR looks solid to me.
- Tom Aarsen
Also @Shuang59 could you share the instruction you used for e5-mistral-7b-instruct? ๐ I'd like to try GritLM-7B on it with the same instruction, which should perform slightly better.
Hi
@Muennighoff
, I used the same instruction used in the original code in this link:
https://huggingface.co/intfloat/e5-mistral-7b-instruct
def get_detailed_instruct(task_description: str, query: str) -> str:
return f'Instruct: {task_description}\nQuery: {query}'
task_prompt = 'Given a web search query, retrieve relevant passages that answer the query'
batch = [get_detailed_instruct(task_prompt, q) for q in batch]
if self.engine == 'intfloat/e5-mistral-7b-instruct':
all_tokens = self.tokenizer(batch, max_length=self.max_token_len - 1, return_attention_mask=False, padding=False, truncation=True)
all_tokens['input_ids'] = [input_ids + [self.tokenizer.eos_token_id] for input_ids in all_tokens['input_ids']]
all_tokens = self.tokenizer.pad(all_tokens, padding=True, return_attention_mask=True, return_tensors='pt')
elif self.engine == 'Salesforce/SFR-Embedding-Mistral':
all_tokens = self.tokenizer(batch, max_length=self.max_token_len, padding=True, truncation=True, return_tensors="pt")
outputs = self.model(**all_tokens)