Spaces:

omkarenator
/

fh-new

Sleeping

App Files Files Community

omkarenator commited on 11 days ago

Commit

888beee

•

1 Parent(s): b6c56e9

add more stuff

Browse files

Files changed (1) hide show

main.py +79 -10

main.py CHANGED Viewed

@@ -346,23 +346,92 @@ def curated(request):
     )
     table_html = data_preparation_steps.to_html(index=False, border=0)
-    table_div = Div(NotStr(table_html), cls="l-body-outset")
-    expander = Details(
-        Summary("Raw Data Extraction"),
-        get_data(),
-        style="border: 1px solid #ccc; padding: 20px;",
-        open=True,
     )
     return Div(
         Section(
             H2("Curated Sources"),
             plotly2fasthtml(get_chart_28168342()),
-            H3("Data Preparation"),
-            table_div,
-            H3("Data Preprocessing"),
-            expander,
             id="inner-text",
         )
     )

     )
     table_html = data_preparation_steps.to_html(index=False, border=0)
+    table_div = Div(NotStr(table_html), style="margin: 40px;")
+    text = P("""This initial stage serves as the foundation for the entire
+    process. Here, we focus on acquiring and extracting the raw data, which can
+    come from various sources such as crawling websites, using HTTP/FTP dumps,
+    or working with archive dumps.  For instance, to download and prepare a
+    dataset, we can specific downloaders based on the data source. Each dataset
+    might have its own downloader script which can be updated in real time to
+    handle changes in the data source.  Here is a general outline of the data
+    preparation process: It's worth noting that some pipelines might require
+    invoking additional functions or scripts to handle specific data sources or
+    formats.  These helper scripts can be located within specific directories
+    or modules dedicated to the dataset.""")
+    data_preparation_div = Div(
+        H3("Data Preparation"),
+        text,
+        table_div,
+        Div(get_data(), style="border: 1px solid #ccc; padding: 20px;"),
     )
+    text = P("""Data preprocessing is a crucial step in the data science
+    pipeline. It involves cleaning and transforming raw data into a format that
+    is suitable for analysis. This process includes handling missing values,
+    normalizing data, encoding categorical variables, and more.""")
+    preprocessing_steps = pd.DataFrame(
+        {
+            "Step": [
+                "Language Filter",
+                "Min Word Count",
+                "Title Abstract",
+                "Majority Language",
+                "Paragraph Count",
+                "Frequency",
+                "Unigram Log Probability",
+            ],
+            "Description": [
+                "Filtering data based on language",
+                "Setting a minimum word count threshold",
+                "Extracting information from the title and abstract",
+                "Identifying the majority language in the dataset",
+                "Counting the number of paragraphs in each document",
+                "Calculating the frequency of each word in the dataset",
+                "Calculating the log probability of each unigram",
+            ],
+            "Need": [
+                "To remove documents in unwanted languages",
+                "To filter out documents with very few words",
+                "To extract relevant information for analysis",
+                "To understand the distribution of languages in the dataset",
+                "To analyze the structure and length of documents",
+                "To identify important words in the dataset",
+                "To measure the significance of individual words",
+            ],
+            "Pros": [
+                "Improves data quality by removing irrelevant documents",
+                "Filters out low-quality or incomplete documents",
+                "Provides additional information for analysis",
+                "Enables language-specific analysis and insights",
+                "Helps understand the complexity and content of documents",
+                "Identifies important terms and topics in the dataset",
+                "Quantifies the importance of individual words",
+            ],
+            "Cons": [
+                "May exclude documents in less common languages",
+                "May remove documents with valuable information",
+                "May introduce bias in the analysis",
+                "May not accurately represent the language distribution",
+                "May not capture the complexity of document structure",
+                "May be sensitive to noise and outliers",
+                "May not capture the semantic meaning of words",
+            ],
+        }
+    )
+    table_html = preprocessing_steps.to_html(index=False, border=0)
+    table_div = Div(NotStr(table_html), style="margin: 40px;")
+    data_preprocessing_div = Div(H3("Data Preprocessing"), text, table_div)
     return Div(
         Section(
             H2("Curated Sources"),
             plotly2fasthtml(get_chart_28168342()),
+            data_preparation_div,
+            data_preprocessing_div,
             id="inner-text",
         )
     )