uzabase
/

LLM2Vec-Llama-2-7b-hf-wikipedia-jp-mntp-unsup-simcse

PEFT

Safetensors

Japanese

English

Model card Files Files and versions Community

h-iida commited on Sep 12

Commit

c1d5dc4

•

1 Parent(s): 9aa453b

Update README.md

Browse files

Files changed (1) hide show

README.md +55 -9

README.md CHANGED Viewed

@@ -12,7 +12,9 @@ language:
 # Model Info
-This is a model that applies LLM2Vec to Swallow. Only the PEFT Adapter is distributed. LLM2Vec fine-tunes on two tasks: MNTP and SimCSE, but this repository contains the results of applying only the MNTP task.
 ## Model Details
@@ -32,26 +34,70 @@ This is a model that applies LLM2Vec to Swallow. Only the PEFT Adapter is distri
 ## Usage
-- Please see [original LLM2Vec repo](https://huggingface.co/McGill-NLP/LLM2Vec-Llama-2-7b-chat-hf-mntp#usage)
 ## Training Details
 ### Training Data
-- [Wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia)
 #### Training Hyperparameter
-- batch_size: 64,
 - gradient_accumulation_steps: 1
-- max_seq_length": 512,
-- mask_token_type: "blank"
-- mlm_probability: 0.2
 - lora_r: 16
-- torch_dtype "bfloat16"
-- attn_implementation "flash_attention_2"
 - bf16: true
 - gradient_checkpointing: true
 #### Accelerator Settings
 - deepspeed_config:

 # Model Info
+This is a model that applies LLM2Vec to Llama-2.　Only the PEFT Adapter is distributed.
+LLM2Vec is fine-tuned on two tasks: MNTP and SimCSE, and this repository contains the results of applying SimCSE after MNTP.
+For the MNTP Adapter, please refer to [this link](https://huggingface.co/uzabase/LLM2Vec-Llama-2-7b-hf-wikipedia-jp-mntp).
 ## Model Details
 ## Usage
+- Please see [original LLM2Vec repo](https://huggingface.co/McGill-NLP/LLM2Vec-Llama-2-7b-chat-hf-mntp-unsup-simcse#usage)
 ## Training Details
 ### Training Data
+- Make Corpus from SimCSE from [Wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia)
+- Script for making SimCSE Corpus
+```
+import argparse
+import random
+import re
+from pathlib import Path
+from datasets import load_dataset
+from tqdm import tqdm
+def main(args):
+    random.seed(args.seed)
+    wiki_ds = load_dataset("wikimedia/wikipedia", "20231101.ja")
+    sampled_index = random.sample(range(len(wiki_ds["train"])), args.N)
+    sample_wiki = wiki_ds["train"][sampled_index]
+    output_texts = []
+    for title, text in tqdm(zip(sample_wiki["title"], sample_wiki["text"])):
+        output_texts.append(title)
+        sentences = re.split("[\n。]", text)
+        for sentence in sentences:
+            if len(sentence) > args.min_sentence_len:
+                output_texts.append(sentence.strip()+"。")
+    with args.output_path.open(mode="w") as f:
+        for line in output_texts:
+            f.write(line)
+            f.write("\n")
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--N", default=200000, type=int)
+    parser.add_argument("--seed", default=42, type=int)
+    parser.add_argument("-o", "--output_path", type=Path)
+    parser.add_argument("--min_sentence_len", default=50, type=int)
+    args = parser.parse_args()
+    main(args)
+```
 #### Training Hyperparameter
+- simcse_dropout: 0.3
+- bidirectional: true
+- pooling_mode: "mean"
+- remove_unused_columns: false
+- learning_rate: 3e-5
+- loss_scale: 20
+- batch_size: 256
 - gradient_accumulation_steps: 1
+- max_seq_length: 128
 - lora_r: 16
+- torch_dtype: "bfloat16"
+- attn_implementation: "flash_attention_2"
+- seed: 42
 - bf16: true
 - gradient_checkpointing: true
 #### Accelerator Settings
 - deepspeed_config: