CausalLM/miniG · About the Data Generation Method

Aug 27

Hi, It's really an amazing model!
Could you please share more details about the data synthesis method? I believe this will be a great contribution to the community.
Thank you so much.

JosephusCheung

CausalLM org Aug 27

We will write about that soon, and see if we can share some.

And here is a brief: Step 1: Filter 20B raw-text data with cross-page relevance on different topics. Step 2: Clustering and make knowledge graph for long-ctx RAG input. Step 3: Generate questions based on the context. Step 4: For each question, use an LLM to judge if answerable. Step 5: Generate answers. Step 6: Judge whether the answers correct and relevant. Step 7: Synthesize Q&A pairs on a single topic and rewrite into multi-turn conversations of ~8k tokens.

qiying

Aug 28

Got it. Thank you for your response. Looking forward to your writings!