BlankCheng commited on
Commit
34cb4c4
1 Parent(s): 0017b35

Ensure format

Browse files
Files changed (1) hide show
  1. curated.py +12 -9
curated.py CHANGED
@@ -720,16 +720,18 @@ filtering_process = Div(
720
  ". Finally, all markdowns were combined to create jsonl files.",
721
  ),
722
  P(B("Unique Data Preparation Challenges: ")),
723
- P("When converting LaTeX files into Markdown using Pandoc, it is crucial to account for different data formats to minimize information loss while also filtering out noisy content in LaTeX. Below, we outline our considerations and methods for handling various data types during this conversion process:"),
 
 
724
  Ul(
725
  Li(
726
  B("Tables: "),
727
- "The process for handling tables follows three main approaches. First, tables compatible with Pandoc’s built-in formats are directly converted into reliable Markdown tables. While table wrappers are removed by default, they can be reintroduced using any desired symbols. Notably, LaTeX’s `\\multicolumn` and `\\multirow` commands are successfully translated into valid Markdown tables. Second, tables unsupported by Pandoc’s native functionality, such as `deluxetable` or other complex LaTeX types, are preserved in their original LaTeX format to maintain the integrity of complex structures. Third, certain tables are converted to HTML web tables. Although the exact conditions for this conversion are unclear, the resulting HTML format is correctly structured.",
728
  style="margin-bottom: -3px",
729
  ),
730
  Li(
731
  B("Mathematical Expressions: "),
732
- "Inline mathematical expressions are rendered in Markdown using `$...$` or `$$...$$` wrappers. More complex equations remain unchanged and are presented as `\\begin{aligned}` blocks to ensure accuracy and readability.",
733
  style="margin-bottom: -3px",
734
  ),
735
  Li(
@@ -739,17 +741,15 @@ filtering_process = Div(
739
  ),
740
  Li(
741
  B("Section Headers: "),
742
- "Section headers are converted into markdown format, using leading `#` symbols to represent the heading levels.",
743
  style="margin-bottom: -3px",
744
  ),
745
  Li(
746
  B("References: "),
747
  "References are removed. Although they may be informative, references often introduce formatting inconsistencies or add little value compared to the core content of the paper.",
748
  style="margin-bottom: -3px",
749
- )
750
- )
751
-
752
-
753
  P(
754
  B(" Filters Applied: "),
755
  "multiple filters are used here after manually verifying output of all the filters as suggested by peS2o dataset",
@@ -906,7 +906,10 @@ filtering_process = Div(
906
  href="ttps://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/",
907
  ),
908
  ". PubMed Central (PMC) files are downloaded in an xml.tar format. The tar files are opened and converted to markdown format using pandoc",
909
- D_code("pandoc <raw_xml_path> -s -o <output_markdown_path> -f jats -t markdown_mmd [--lua-filter <lua_filter_path>]", language="bash"),
 
 
 
910
  ". The markdown files are combined to create jsonl files. PubMed Abstract (PMA) files were downloaded in xml. The BeautifulSoup library was used to extract the abstract, title, and PMID. All files were stored in jsonl format.",
911
  ),
912
  P(B("Unique Data Preparation Challenges: ")),
 
720
  ". Finally, all markdowns were combined to create jsonl files.",
721
  ),
722
  P(B("Unique Data Preparation Challenges: ")),
723
+ P(
724
+ "When converting LaTeX files into Markdown using Pandoc, it is crucial to account for different data formats to minimize information loss while also filtering out noisy content in LaTeX. Below, we outline our considerations and methods for handling various data types during this conversion process:"
725
+ ),
726
  Ul(
727
  Li(
728
  B("Tables: "),
729
+ "The process for handling tables follows three main approaches. First, tables compatible with Pandoc’s built-in formats are directly converted into standard Markdown tables. Notably, LaTeX’s '\\multicolumn' and '\\multirow' commands can be successfully translated into valid Markdown tables. Second, tables unsupported by Pandoc’s native functionality, such as deluxetable or other complex LaTeX types, are preserved in their original LaTeX format to maintain the integrity of complex structures. Third, only a few remaining tables have been converted to HTML web tables.",
730
  style="margin-bottom: -3px",
731
  ),
732
  Li(
733
  B("Mathematical Expressions: "),
734
+ "Inline mathematical expressions are rendered in Markdown. More complex equations remain unchanged, e.g., presented as '\\begin{aligned}' blocks, to ensure accuracy and readability.",
735
  style="margin-bottom: -3px",
736
  ),
737
  Li(
 
741
  ),
742
  Li(
743
  B("Section Headers: "),
744
+ "Section headers are converted into markdown format, using leading '#' symbols to represent the heading levels.",
745
  style="margin-bottom: -3px",
746
  ),
747
  Li(
748
  B("References: "),
749
  "References are removed. Although they may be informative, references often introduce formatting inconsistencies or add little value compared to the core content of the paper.",
750
  style="margin-bottom: -3px",
751
+ ),
752
+ ),
 
 
753
  P(
754
  B(" Filters Applied: "),
755
  "multiple filters are used here after manually verifying output of all the filters as suggested by peS2o dataset",
 
906
  href="ttps://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/",
907
  ),
908
  ". PubMed Central (PMC) files are downloaded in an xml.tar format. The tar files are opened and converted to markdown format using pandoc",
909
+ D_code(
910
+ "pandoc <raw_xml_path> -s -o <output_markdown_path> -f jats -t markdown_mmd [--lua-filter <lua_filter_path>]",
911
+ language="bash",
912
+ ),
913
  ". The markdown files are combined to create jsonl files. PubMed Abstract (PMA) files were downloaded in xml. The BeautifulSoup library was used to extract the abstract, title, and PMID. All files were stored in jsonl format.",
914
  ),
915
  P(B("Unique Data Preparation Challenges: ")),