omwdataset

Running

App Files Files Community

BlankCheng commited on 10 days ago

Commit

34cb4c4

•

1 Parent(s): 0017b35

Ensure format

Browse files

Files changed (1) hide show

curated.py +12 -9

curated.py CHANGED Viewed

@@ -720,16 +720,18 @@ filtering_process = Div(
                 ". Finally, all markdowns were combined to create jsonl files.",
             ),
             P(B("Unique Data Preparation Challenges: ")),
-            P("When converting LaTeX files into Markdown using Pandoc, it is crucial to account for different data formats to minimize information loss while also filtering out noisy content in LaTeX. Below, we outline our considerations and methods for handling various data types during this conversion process:"),
             Ul(
                 Li(
                     B("Tables: "),
-                    "The process for handling tables follows three main approaches. First, tables compatible with Pandoc’s built-in formats are directly converted into reliable Markdown tables. While table wrappers are removed by default, they can be reintroduced using any desired symbols. Notably, LaTeX’s `\\multicolumn` and `\\multirow` commands are successfully translated into valid Markdown tables. Second, tables unsupported by Pandoc’s native functionality, such as `deluxetable` or other complex LaTeX types, are preserved in their original LaTeX format to maintain the integrity of complex structures. Third, certain tables are converted to HTML web tables. Although the exact conditions for this conversion are unclear, the resulting HTML format is correctly structured.",
                     style="margin-bottom: -3px",
                 ),
                 Li(
                     B("Mathematical Expressions: "),
-                    "Inline mathematical expressions are rendered in Markdown using `$...$` or `$$...$$` wrappers. More complex equations remain unchanged and are presented as `\\begin{aligned}` blocks to ensure accuracy and readability.",
                     style="margin-bottom: -3px",
                 ),
                 Li(
@@ -739,17 +741,15 @@ filtering_process = Div(
                 ),
                 Li(
                     B("Section Headers: "),
-                    "Section headers are converted into markdown format, using leading `#` symbols to represent the heading levels.",
                     style="margin-bottom: -3px",
                 ),
                 Li(
                     B("References: "),
                     "References are removed. Although they may be informative, references often introduce formatting inconsistencies or add little value compared to the core content of the paper.",
                     style="margin-bottom: -3px",
-                )
-            )
             P(
                 B(" Filters Applied: "),
                 "multiple filters are used here after manually verifying output of all the filters as suggested by peS2o dataset",
@@ -906,7 +906,10 @@ filtering_process = Div(
                     href="ttps://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/",
                 ),
                 ". PubMed Central (PMC) files are downloaded in an xml.tar format. The tar files are opened and converted to markdown format using pandoc",
-                D_code("pandoc <raw_xml_path> -s -o <output_markdown_path> -f jats -t markdown_mmd [--lua-filter <lua_filter_path>]", language="bash"),
                 ". The markdown files are combined to create jsonl files. PubMed Abstract (PMA) files were downloaded in xml. The BeautifulSoup library was used to extract the abstract, title, and PMID. All files were stored in jsonl format.",
             ),
             P(B("Unique Data Preparation Challenges: ")),

                 ". Finally, all markdowns were combined to create jsonl files.",
             ),
             P(B("Unique Data Preparation Challenges: ")),
+            P(
+                "When converting LaTeX files into Markdown using Pandoc, it is crucial to account for different data formats to minimize information loss while also filtering out noisy content in LaTeX. Below, we outline our considerations and methods for handling various data types during this conversion process:"
+            ),
             Ul(
                 Li(
                     B("Tables: "),
+                    "The process for handling tables follows three main approaches. First, tables compatible with Pandoc’s built-in formats are directly converted into standard Markdown tables. Notably, LaTeX’s '\\multicolumn' and '\\multirow' commands can be successfully translated into valid Markdown tables. Second, tables unsupported by Pandoc’s native functionality, such as deluxetable or other complex LaTeX types, are preserved in their original LaTeX format to maintain the integrity of complex structures. Third, only a few remaining tables have been converted to HTML web tables.",
                     style="margin-bottom: -3px",
                 ),
                 Li(
                     B("Mathematical Expressions: "),
+                    "Inline mathematical expressions are rendered in Markdown. More complex equations remain unchanged, e.g., presented as '\\begin{aligned}' blocks, to ensure accuracy and readability.",
                     style="margin-bottom: -3px",
                 ),
                 Li(
                 ),
                 Li(
                     B("Section Headers: "),
+                    "Section headers are converted into markdown format, using leading '#' symbols to represent the heading levels.",
                     style="margin-bottom: -3px",
                 ),
                 Li(
                     B("References: "),
                     "References are removed. Although they may be informative, references often introduce formatting inconsistencies or add little value compared to the core content of the paper.",
                     style="margin-bottom: -3px",
+                ),
+            ),
             P(
                 B(" Filters Applied: "),
                 "multiple filters are used here after manually verifying output of all the filters as suggested by peS2o dataset",
                     href="ttps://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/",
                 ),
                 ". PubMed Central (PMC) files are downloaded in an xml.tar format. The tar files are opened and converted to markdown format using pandoc",
+                D_code(
+                    "pandoc <raw_xml_path> -s -o <output_markdown_path> -f jats -t markdown_mmd [--lua-filter <lua_filter_path>]",
+                    language="bash",
+                ),
                 ". The markdown files are combined to create jsonl files. PubMed Abstract (PMA) files were downloaded in xml. The BeautifulSoup library was used to extract the abstract, title, and PMID. All files were stored in jsonl format.",
             ),
             P(B("Unique Data Preparation Challenges: ")),