I am delighted to announce the publication of my LegalKit, a French labeled dataset built for legal ML training π€
This dataset comprises multiple query-document pairs (+50k) curated for training sentence embedding models within the domain of French law.
The labeling process follows a systematic approach to ensure consistency and relevance: - Initial Query Generation: Three instances of the LLaMA-3-70B model independently generate three different queries based on the same document. - Selection of Optimal Query: A fourth instance of the LLaMA-3-70B model, using a dedicated selection prompt, evaluates the generated queries and selects the most suitable one. - Final Label Assignment: The chosen query is used to label the document, aiming to ensure that the label accurately reflects the content and context of the original text.