omkarenator commited on
Commit
dcb73ca
β€’
1 Parent(s): 0cc9c9f

add better findcc figure

Browse files
.gitattributes CHANGED
@@ -48,3 +48,4 @@ data/meta_non_web.py filter=lfs diff=lfs merge=lfs -text
48
  data/sample_bad_urls.py filter=lfs diff=lfs merge=lfs -text
49
  data/sample_refinedweb_line.json filter=lfs diff=lfs merge=lfs -text
50
  images/llm360_logo.png filter=lfs diff=lfs merge=lfs -text
 
 
48
  data/sample_bad_urls.py filter=lfs diff=lfs merge=lfs -text
49
  data/sample_refinedweb_line.json filter=lfs diff=lfs merge=lfs -text
50
  images/llm360_logo.png filter=lfs diff=lfs merge=lfs -text
51
+ images/findcc.svg filter=lfs diff=lfs merge=lfs -text
common.py CHANGED
@@ -1,6 +1,6 @@
1
  from fasthtml.common import *
2
  from fasthtml.components import *
3
- from fasthtml.components import D_title, D_article, D_front_matter, D_contents, D_byline
4
  from fh_plotly import plotly2fasthtml
5
  import pandas as pd
6
  import json
@@ -352,7 +352,7 @@ global_div = Div(
352
  ),
353
  Section(
354
  H3("Stage 4: Finding Connected Components using MapReduce"),
355
- Img(src="images/cc.png", style="max-width: 100%;"),
356
  P(
357
  "The purpose of this step is to create a set of clusters of matching pairs. For example, a list of pairs (A, B), (B, C), (D, E) is merged into a list of components (A, B, C) and (D, E). Using a third-party library like NetworkX to find connected components would require all pairs to fit into the memory of a single machine, which is not feasible. Instead, we implemented a distributed connected component finder [4] using the Dask framework, which can scale across multiple machines. The algorithm works by mapping edges by both the source and destination of pairs and reducing only edges where the source is greater than the destination. It performs successive iterations of this MapReduce computation until convergence, meaning the number of new edges produced becomes zero. In the end, every document in a cluster points to the smallest document within the cluster. Later, we compile a list of duplicate documents that need deletion and gather statistics about each component."
358
  ),
 
1
  from fasthtml.common import *
2
  from fasthtml.components import *
3
+ from fasthtml.components import D_title, D_article, D_front_matter, D_contents, D_byline, D_cite
4
  from fh_plotly import plotly2fasthtml
5
  import pandas as pd
6
  import json
 
352
  ),
353
  Section(
354
  H3("Stage 4: Finding Connected Components using MapReduce"),
355
+ Img(src="images/findcc.svg", style="max-width: 100%;"),
356
  P(
357
  "The purpose of this step is to create a set of clusters of matching pairs. For example, a list of pairs (A, B), (B, C), (D, E) is merged into a list of components (A, B, C) and (D, E). Using a third-party library like NetworkX to find connected components would require all pairs to fit into the memory of a single machine, which is not feasible. Instead, we implemented a distributed connected component finder [4] using the Dask framework, which can scale across multiple machines. The algorithm works by mapping edges by both the source and destination of pairs and reducing only edges where the source is greater than the destination. It performs successive iterations of this MapReduce computation until convergence, meaning the number of new edges produced becomes zero. In the end, every document in a cluster points to the smallest document within the cluster. Later, we compile a list of duplicate documents that need deletion and gather statistics about each component."
358
  ),
images/{cc.png β†’ findcc.svg} RENAMED
File without changes