guipenedo HF staff commited on
Commit
0bf0346
1 Parent(s): f51d538

added citation information

Browse files
Files changed (2) hide show
  1. dist/index.html +31 -13
  2. src/index.html +31 -0
dist/index.html CHANGED
@@ -704,23 +704,41 @@
704
  <p>Through our open science efforts we hope to keep shining a light on the black box that is the training of high performance large language models as well as to give every model trainer the ability to create state-of-the-art LLMs. We are excited to continue iterating on FineWeb and to release increasingly better filtered subsets of web data, in a fully open and reproducible manner.</p>
705
  <p>In the short term, we are looking forward to applying the learnings from (English) FineWeb to other languages. While English currently dominates the LLM landscape, we believe that making high quality web data in other languages as accessible as possible would be incredibly impactful.</p>
706
  <p>In a nutshell: the future is bright and exciting for studying the science of creating datasets at scale and in the open 🤗.</p>
707
-
708
- <h2>Citation</h2>
709
- <d-code block language="latex">
710
- @misc{penedo2024finewebdatasetsdecantingweb,
711
- title={The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale},
712
- author={Guilherme Penedo and Hynek Kydlíček and Loubna Ben allal and Anton Lozhkov and Margaret Mitchell and Colin Raffel and Leandro Von Werra and Thomas Wolf},
713
- year={2024},
714
- eprint={2406.17557},
715
- archivePrefix={arXiv},
716
- primaryClass={cs.CL}
717
- url={https://arxiv.org/abs/2406.17557},
718
- }
719
- </d-code>
720
  </d-article>
721
 
722
  <d-appendix>
723
  <d-bibliography src="bibliography.bib"></d-bibliography>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
724
  </d-appendix>
725
 
726
  <script>
 
704
  <p>Through our open science efforts we hope to keep shining a light on the black box that is the training of high performance large language models as well as to give every model trainer the ability to create state-of-the-art LLMs. We are excited to continue iterating on FineWeb and to release increasingly better filtered subsets of web data, in a fully open and reproducible manner.</p>
705
  <p>In the short term, we are looking forward to applying the learnings from (English) FineWeb to other languages. While English currently dominates the LLM landscape, we believe that making high quality web data in other languages as accessible as possible would be incredibly impactful.</p>
706
  <p>In a nutshell: the future is bright and exciting for studying the science of creating datasets at scale and in the open 🤗.</p>
 
 
 
 
 
 
 
 
 
 
 
 
 
707
  </d-article>
708
 
709
  <d-appendix>
710
  <d-bibliography src="bibliography.bib"></d-bibliography>
711
+ <style>
712
+ d-appendix .citation {
713
+ font-size: 11px;
714
+ line-height: 15px;
715
+ border-left: 1px solid rgba(0, 0, 0, 0.1);
716
+ padding-left: 18px;
717
+ border: 1px solid rgba(0,0,0,0.1);
718
+ background: rgba(0, 0, 0, 0.02);
719
+ padding: 10px 18px;
720
+ border-radius: 3px;
721
+ color: rgba(150, 150, 150, 1);
722
+ overflow: hidden;
723
+ margin-top: -12px;
724
+ white-space: pre-wrap;
725
+ word-wrap: break-word;
726
+ }
727
+ </style>
728
+
729
+ <h3 id="citation">Citation</h3>
730
+ <p>For attribution in academic contexts, please cite this work as</p>
731
+ <pre class="citation short">Penedo, et al., "The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale", 2024.</pre>
732
+ <p>BibTeX citation</p>
733
+ <pre class="citation long">@misc{penedo2024finewebdatasetsdecantingweb,
734
+ title={The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale},
735
+ author={Guilherme Penedo and Hynek Kydlíček and Loubna Ben allal and Anton Lozhkov and Margaret Mitchell and Colin Raffel and Leandro Von Werra and Thomas Wolf},
736
+ year={2024},
737
+ eprint={2406.17557},
738
+ archivePrefix={arXiv},
739
+ primaryClass={cs.CL}
740
+ url={https://arxiv.org/abs/2406.17557},
741
+ }</pre>
742
  </d-appendix>
743
 
744
  <script>
src/index.html CHANGED
@@ -708,6 +708,37 @@
708
 
709
  <d-appendix>
710
  <d-bibliography src="bibliography.bib"></d-bibliography>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
711
  </d-appendix>
712
 
713
  <script>
 
708
 
709
  <d-appendix>
710
  <d-bibliography src="bibliography.bib"></d-bibliography>
711
+ <style>
712
+ d-appendix .citation {
713
+ font-size: 11px;
714
+ line-height: 15px;
715
+ border-left: 1px solid rgba(0, 0, 0, 0.1);
716
+ padding-left: 18px;
717
+ border: 1px solid rgba(0,0,0,0.1);
718
+ background: rgba(0, 0, 0, 0.02);
719
+ padding: 10px 18px;
720
+ border-radius: 3px;
721
+ color: rgba(150, 150, 150, 1);
722
+ overflow: hidden;
723
+ margin-top: -12px;
724
+ white-space: pre-wrap;
725
+ word-wrap: break-word;
726
+ }
727
+ </style>
728
+
729
+ <h3 id="citation">Citation</h3>
730
+ <p>For attribution in academic contexts, please cite this work as</p>
731
+ <pre class="citation short">Penedo, et al., "The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale", 2024.</pre>
732
+ <p>BibTeX citation</p>
733
+ <pre class="citation long">@misc{penedo2024finewebdatasetsdecantingweb,
734
+ title={The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale},
735
+ author={Guilherme Penedo and Hynek Kydlíček and Loubna Ben allal and Anton Lozhkov and Margaret Mitchell and Colin Raffel and Leandro Von Werra and Thomas Wolf},
736
+ year={2024},
737
+ eprint={2406.17557},
738
+ archivePrefix={arXiv},
739
+ primaryClass={cs.CL}
740
+ url={https://arxiv.org/abs/2406.17557},
741
+ }</pre>
742
  </d-appendix>
743
 
744
  <script>