Spaces:

LLM360
/

TxT360

Running

App Files Files Community

omkarenator commited on Oct 21

Commit

76c9852

•

1 Parent(s): e58e006

citations and other fixes

Browse files

Files changed (3) hide show

common.py +4 -3
curated.py +5 -5
main.py +10 -4

common.py CHANGED Viewed

@@ -308,7 +308,7 @@ global_div = Div(
             "Deduplication is beneficial for LM pretraining in several ways, with the most important being controllable upsampling. With unique data, teams gain fine-grained control over the training data. Other benefits of deduplication include avoiding train-test overlap which prevents evaluation contamination."
         ),
         P(
-            "Duplicate data can lead to a strong double descent phenomenon, where repeated data causes test loss to increase midway through training [2]. Additionally, it reduces the risk of memorization [1]. By implementing deduplication and selective upsampling, we gain control over the pretraining data distribution, rather than relying on the inherent distribution of the source."
         ),
         P(
             "To illustrate the need for deduplication, below is the distribution of near-duplicate clusters, organized into buckets of 100. The first bucket contains clusters with sizes ranging from 2 to 100, as found in the Common Crawl dataset. Some clusters even reach up to a million documents."
@@ -406,6 +406,7 @@ global_div = Div(
             "In sizable clusters comprising 1000 or more documents, we observe a trend towards templatization. This involves the recurrent use of standardized language to convey general topics such as terms and conditions, warnings, and disclaimers. Such language is prevalent on commercial websites, offering a consistent and efficient way to communicate commonly encountered information."
         ),
         Img(src="images/image9.png", style="max-width: 100%;"),
     ),
     Section(
         H2("Personally Identifiable Information Removal"),
@@ -435,7 +436,7 @@ global_div = Div(
                 style="list-style-type: none",
             ),
         ),
-        id="section47",
     ),
     Section(
         H2("Normalization Form C"),
@@ -455,7 +456,7 @@ global_div = Div(
                 style="list-style-type: none",
             )
         ),  # "background-color= gray" "color= blue" maybe add this later
-        id="section48",
     ),
     Section(
         H3("NFC Examples"),

             "Deduplication is beneficial for LM pretraining in several ways, with the most important being controllable upsampling. With unique data, teams gain fine-grained control over the training data. Other benefits of deduplication include avoiding train-test overlap which prevents evaluation contamination."
         ),
         P(
+            "Duplicate data can lead to a strong double descent phenomenon, where repeated data causes test loss to increase midway through training",  D_cite(bibtex_key="hernandez2022scaling"), ". Additionally, it reduces the risk of memorization", D_cite(bibtex_key="lee2022deduplicating"), ". By implementing deduplication and selective upsampling, we gain control over the pretraining data distribution, rather than relying on the inherent distribution of the source."
         ),
         P(
             "To illustrate the need for deduplication, below is the distribution of near-duplicate clusters, organized into buckets of 100. The first bucket contains clusters with sizes ranging from 2 to 100, as found in the Common Crawl dataset. Some clusters even reach up to a million documents."
             "In sizable clusters comprising 1000 or more documents, we observe a trend towards templatization. This involves the recurrent use of standardized language to convey general topics such as terms and conditions, warnings, and disclaimers. Such language is prevalent on commercial websites, offering a consistent and efficient way to communicate commonly encountered information."
         ),
         Img(src="images/image9.png", style="max-width: 100%;"),
+        id="section47",
     ),
     Section(
         H2("Personally Identifiable Information Removal"),
                 style="list-style-type: none",
             ),
         ),
+        id="section48",
     ),
     Section(
         H2("Normalization Form C"),
                 style="list-style-type: none",
             )
         ),  # "background-color= gray" "color= blue" maybe add this later
+        id="section49",
     ),
     Section(
         H3("NFC Examples"),

curated.py CHANGED Viewed

@@ -296,7 +296,7 @@ table_div_hn = Div(NotStr(table_html_hn))
 uirc_filter = pd.DataFrame(
     {
         "Dataset": [
-            "Ubunutu IRC",
         ],
         "Lines Downloaded": [
             "37966",
@@ -854,7 +854,7 @@ filtering_process = Div(
                     style="margin-bottom: -3px",
                 ),
                 Li(
-                    "Paragraph Count Filter: The paper must have at least 5 paragraphs after removing paragraphs with less than -20 average log world probability",
                     style="margin-bottom: -3px",
                 ),
                 Li(
@@ -1140,7 +1140,7 @@ filtering_process = Div(
         Raw single line in data: <P> Hi I am speaker
         After tag removal: P Hi I am speaker
         We remove everything that starts with ["P", "BRK", "CHAPTER", "/P"]
-        and only keep tagnae == SPEAKER
         because line starting with <SPEAKER> TEXT TEXT ....... has the relevant text
         """,
                 style="block",
@@ -1217,7 +1217,7 @@ filtering_process = Div(
                     style="margin-bottom: -3px",
                 ),
                 Li(
-                    "As discussed above, the comment heirarchies required a thoughful approach to extracting meaningful data. ",
                     style="margin-bottom: -3px",
                 ),
                 Li(
@@ -1374,7 +1374,7 @@ filtering_process = Div(
             P(B("Unique Data Preparation Challenges: ")),
             Ul(
                 Li(
-                    "Handling code block was a required finding the specific blocks and exacting the details in one snippet.",
                     style="margin-bottom: -3px",
                 ),
                 Li(

 uirc_filter = pd.DataFrame(
     {
         "Dataset": [
+            "Ubuntu IRC",
         ],
         "Lines Downloaded": [
             "37966",
                     style="margin-bottom: -3px",
                 ),
                 Li(
+                    "Paragraph Count Filter: The paper must have at least 5 paragraphs after removing paragraphs with less than -20 average log word probability",
                     style="margin-bottom: -3px",
                 ),
                 Li(
         Raw single line in data: <P> Hi I am speaker
         After tag removal: P Hi I am speaker
         We remove everything that starts with ["P", "BRK", "CHAPTER", "/P"]
+        and only keep tagname == SPEAKER
         because line starting with <SPEAKER> TEXT TEXT ....... has the relevant text
         """,
                 style="block",
                     style="margin-bottom: -3px",
                 ),
                 Li(
+                    "As discussed above, the comment hierarchies required a thoughtful approach to extracting meaningful data. ",
                     style="margin-bottom: -3px",
                 ),
                 Li(
             P(B("Unique Data Preparation Challenges: ")),
             Ul(
                 Li(
+                    "Handling code block was a required finding the specific blocks and extracting the details in one snippet.",
                     style="margin-bottom: -3px",
                 ),
                 Li(

main.py CHANGED Viewed

@@ -328,16 +328,22 @@ def main():
                             ),
                             Li(
                                 A(
-                                    "Personally Identifiable Information Removal",
                                     href="#section47",
                                 )
                             ),
                             Li(
                                 A(
-                                    "Normalization Form C",
                                     href="#section48",
                                 )
                             ),
                         ),
                     ),
                     Div(
@@ -872,7 +878,7 @@ def intro():
                 D_cite(bibtex_key="dclm"),
                 "and RedPajama V2,",
                 D_cite(bibtex_key="redpajama-v2"),
-                "we also hope to provide a dataset at this scale that is ready to go, without requiring futher filtering."
             ),
             P(
                 B("How to Read this Blog Post?"),
@@ -884,7 +890,7 @@ def intro():
         Section(
             H2("Why TxT360"),
             P(
-                "In this year we have seen excellent datasets released by the community. Among those, most datasets focus on one source (e.g., crawled websites, code bases, papers). However, it is not trivial to combine these sources together due to the potential duplicaiton across them. TxT360 is the first dataset to combine most of sources commonly used in pretraining."
             ),
             new_table_div_1,
             # table_div_1,

                             ),
                             Li(
                                 A(
+                                    "Analysis of Near-Duplicate Clusters",
                                     href="#section47",
                                 )
                             ),
                             Li(
                                 A(
+                                    "Personally Identifiable Information Removal",
                                     href="#section48",
                                 )
                             ),
+                            Li(
+                                A(
+                                    "Normalization Form C",
+                                    href="#section49",
+                                )
+                            ),
                         ),
                     ),
                     Div(
                 D_cite(bibtex_key="dclm"),
                 "and RedPajama V2,",
                 D_cite(bibtex_key="redpajama-v2"),
+                "we also hope to provide a dataset at this scale that is ready to go, without requiring further filtering."
             ),
             P(
                 B("How to Read this Blog Post?"),
         Section(
             H2("Why TxT360"),
             P(
+                "In this year we have seen excellent datasets released by the community. Among those, most datasets focus on one source (e.g., crawled websites, code bases, papers). However, it is not trivial to combine these sources together due to the potential duplication across them. TxT360 is the first dataset to combine most of sources commonly used in pretraining."
             ),
             new_table_div_1,
             # table_div_1,