Spaces:

LLM360
/

TxT360

Running

App Files Files Community

victormiller commited on 19 days ago

Commit

aa13e37

•

1 Parent(s): ae1d7f9

Update web.py

Browse files

Files changed (1) hide show

web.py +478 -25

web.py CHANGED Viewed

@@ -612,11 +612,59 @@ def web_data():
         but longer duplicate passages. To achieve this goal, we calculate over the document both the fraction of passages
         that are duplicates, and the fraction of characters contained within those duplicated passages.
         """),
-        H6("Implementations from Dolma"),
-        D_code(dolma311, block="block", language="python"),
-        P("..."),  # Add specific implementation details if available
-        H6("Implementations from DataTrove"),
-        P("..."),  # Add specific implementation details if available
         P("""
         After evaluating the implementations of Dolma and DataTrove (note: RedPajama V2 does not implement these two quality
         signals), we have made the following decisions:
@@ -639,6 +687,25 @@ def web_data():
         ensures consistency with the overall document character count calculation.
         """),
         H5("Our Implementation"),
         Details(
             Summary("Sample documents filtered by excessive line repetitions / characters in repeated lines"),
             DV(
@@ -652,12 +719,85 @@ def web_data():
         Following Gopher [2], we remove documents with a high portion of n-grams. For each n ∈ (2, 3, 4), we calculate the
         fraction of characters contained within the most frequently-occurring n-gram.
         """),
-        H6("Implementations from Dolma"),
-        P("..."),  # Add specific implementation details if available
-        H6("Implementations from RedPajama-V2"),
-        P("..."),  # Add specific implementation details if available
-        H6("Implementations from DataTrove"),
-        P("..."),  # Add specific implementation details if available
         P("""
         There are almost no contradictions between above implementations of fractions of characters in the most common
         n-gram. The main process involves counting the occurrences of each n-gram and selecting the most common one. The
@@ -668,7 +808,23 @@ def web_data():
         In practice, documents affected by this rule — where the most common n-gram exceeds a given threshold and occurs
         only once — tend to be short.
         """),
-        H5("Our Implementations"),
         Details(
             Summary("Sample documents filtered by the fraction of characters in the most common n-grams (n=2,3,4)"),
             DV(
@@ -683,27 +839,172 @@ def web_data():
         fraction of characters contained within all duplicate n-grams, taking care not to count characters that occur in
         overlapping n-grams more than once.
         """),
-        H6("Implementations from Dolma"),
-        P("..."),  # Add specific implementation details if available
-        H6("Implementations from RedPajama-V2"),
-        P("..."),  # Add specific implementation details if available
-        H6("Implementations from DataTrove"),
-        P("..."),  # Add specific implementation details if available
         P("""
         For the computation of fraction of characters in duplicate n-gram, Dolma uses the number of characters in all
         n-grams (with overlapping) as the denominator, and uses the number of characters in all duplicated n-grams
-        (with overlapping) as the numerator. RedPajama V2 uses the number of all characters in (the words of) the document
         (without overlapping) as the denominator, and uses the number of characters that are recognized as part of the
-        duplicate n-gram as the numerator. Datatrove uses the number of all characters in the document (including white
         spaces, without overlapping) as the denominator, and uses the number of characters that are recognized as
         duplicate n-gram as the numerator. However, there is a mismatch in DataTrove’s calculation, as the number of
         characters in the duplicated n-grams excludes white spaces, while the total character count of the document
-        does not.
-        We decided to use the RedPajama V2 implementation but skip the 1st occurrence of the duplicate n-gram.
         """),
-        H5("Our Implementations"),
-        H5("An Example to Show the Difference Between Above Implementations"),
-        P("..."),  # Add specific examples if available
         H5(
             "Sample Documents Filtered by the Fraction of Characters in Duplicated N-grams (n=5,...,10)"
         ),
@@ -722,6 +1023,71 @@ def web_data():
         works ([2], [3], [6]), we remove the documents if more than 30% of the lines end with an ellipsis or more than
         90% of lines start with a bullet point.
         """),
         Details(
             Summary("Sample documents that are filtered out by line-wise heuristics"),
             DV(
@@ -730,6 +1096,7 @@ def web_data():
                 "Sample documents that are filtered out by line-wise heuristics",
             ),
         ),
         H4("3.3 Statistics-based Heuristics"),
         P("We summarize other statistics-based rules originated from Gopher [7] in this section. The statistics can be used include:"),
         Ul(
@@ -753,17 +1120,51 @@ def web_data():
         Details(
             Summary("Implementations from Dolma"),
             D_code("""
             """, block="block", language="python"),
         ),
         Details(
             Summary("Implementations from RedPajama-V2"),
             D_code("""
             """, block="block", language="python"),
         ),
         Details(
             Summary("Implementations from DataTrove"),
             D_code("""
             """, block="block", language="python"),
         ),
         P("""
@@ -798,6 +1199,16 @@ def web_data():
         Details(
             Summary("Implementations from RedPajama-V2"),
             D_code("""
             """, block="block", language="python"),
         ),
         P("""
@@ -807,6 +1218,13 @@ def web_data():
         Details(
             Summary("TxT360 Implementation"),
             D_code("""
             """, block="block", language="python"),
         ),
@@ -818,22 +1236,57 @@ def web_data():
         Details(
             Summary("Implementations from Dolma"),
             D_code("""
             """, block="block", language="python"),
         ),
         Details(
             Summary("Implementations from RedPajama-V2"),
             D_code("""
             """, block="block", language="python"),
         ),
         Details(
             Summary("Implementations from DataTrove"),
             D_code("""
             """, block="block", language="python"),
         ),
         Details(
             Summary("TxT360 Implementation"),
             D_code("""
             """, block="block", language="python"),
         ),

         but longer duplicate passages. To achieve this goal, we calculate over the document both the fraction of passages
         that are duplicates, and the fraction of characters contained within those duplicated passages.
         """),
+        Details(
+            Summary("Implementations from Dolma"),
+            D_code("""
+            words = text.split()
+            word_count = len(words)
+            character_count = sum(len(word) for word in words)
+            ...
+            lines = text.split("\n")
+            line_count = len(lines)
+            ...
+            line_counts = Counter(lines)
+            attrs.fraction_of_duplicate_lines = sum(count for line, count in line_counts.items() if count > 1) / max(
+                line_count, 1
+            )
+            attrs.fraction_of_characters_in_duplicate_lines = sum(
+                len(line) * count for line, count in line_counts.items() if count > 1
+            ) / max(character_count, 1)
+            """, block="block", language="python"),
+        ),
+        Details(
+            Summary("Implementations from DataTrove"),
+            D_code("""
+            def find_duplicates(x: list[str]) -> tuple[int, int]:
+                unique_x = set()
+                duplicate_chars = 0
+                duplicate_elements = 0
+                for element in x:
+                    if element in unique_x:
+                        duplicate_chars += len(element)
+                        duplicate_elements += 1
+                    else:
+                        unique_x.add(element)
+                return duplicate_elements, duplicate_chars
+            ...
+            self.paragraph_exp = re.compile(r"\n{2,}")
+            self._line_splitter = re.compile("\n+")
+            ...
+            paragraphs = self.paragraph_exp.split(text.strip())
+            paragraphs_duplicates, char_duplicates = find_duplicates(paragraphs)
+            if self.dup_para_frac and paragraphs_duplicates / len(paragraphs) > self.dup_para_frac:
+                return False, "dup_para_frac"
+            if self.dup_para_char_frac and char_duplicates / len(text) > self.dup_para_char_frac:
+                return False, "dup_para_char_frac"
+            lines = self._line_splitter.split(text)
+            line_duplicates, char_duplicates = find_duplicates(lines)
+            if self.dup_line_frac and line_duplicates / len(lines) > self.dup_line_frac:
+                return False, "dup_line_frac"
+            if self.dup_line_char_frac and char_duplicates / len(text) > self.dup_line_char_frac:
+                return False, "dup_line_char_frac"
+            """, block="block", language="python"),
+        ),
         P("""
         After evaluating the implementations of Dolma and DataTrove (note: RedPajama V2 does not implement these two quality
         signals), we have made the following decisions:
         ensures consistency with the overall document character count calculation.
         """),
         H5("Our Implementation"),
+        Details(
+            Summary("TxT360 Implementation"),
+            D_code("""
+            words = text.split()
+            word_count = len(words)
+            character_count = sum(len(word) for word in words)
+            ...
+            lines = text.split("\n")
+            line_count = len(lines)
+            line_counts = Counter(lines)
+            attrs.fraction_of_duplicate_lines = (
+                sum((count - 1) for line, count in line_counts.items() if count > 1) / line_count
+            )
+            attrs.fraction_of_characters_in_duplicate_lines = (
+                sum(sum(len(w) for w in line.split()) * (count - 1) for line, count in
+                line_counts.items() if count > 1) / character_count
+            """, block="block", language="python"),
+        ),
         Details(
             Summary("Sample documents filtered by excessive line repetitions / characters in repeated lines"),
             DV(
         Following Gopher [2], we remove documents with a high portion of n-grams. For each n ∈ (2, 3, 4), we calculate the
         fraction of characters contained within the most frequently-occurring n-gram.
         """),
+        Details(
+            Summary("Implementations from Dolma"),
+            D_code("""
+            def all_ngram_counts(words) -> List[Tuple[int, CounterType[Tuple[str, ...]]]]:
+                return [(n, Counter(list(zip(*[words[i:] for i in range(n)])))) for n in range(2, 11)]
+            ...
+            all_counts = all_ngram_counts(words)
+            count_most_common_ngrams = (2, 3, 4)
+            for n, ngram_counts in all_counts:
+                if not ngram_counts:
+                    continue
+                if n in count_most_common_ngrams:
+                    most_common_ngram, count = ngram_counts.most_common(1)[0]
+                    value = count * sum(len(w) for w in most_common_ngram) / max(character_count, 1)
+                    attrs.fraction_of_characters_in_most_common_ngram.append((n, value))
+            """, block="block", language="python"),
+        ),
+        Details(
+            Summary("Implementations from RedPajama-V2"),
+            D_code("""
+                class Base_RPS_Frac_Chars_In_Top_NGram(RPSBase):  # noqa
+                    ## Base class for calculating the fraction of characters in the top N-gram. This operates on the lower-cased, punctation removed content.
+                    NGRAM_SIZE: int = None
+                    __slots__ = []
+                    def __call__(self, document: Document) -> SignalType:
+                        if self.NGRAM_SIZE is None:
+                            raise NotImplementedError(
+                                "NGRAM_SIZE must be set in the subclass"
+                            )
+                        # get the most common ngram
+                        most_common_ngram = Counter(
+                            # fetch the ngrams from the document if they exist, otherwise
+                            # compute them
+                            getattr(document, f"norm_self.NGRAM_SIZEgrams", None)
+                            or
+                            form_ngrams(iter(document.normalized_words), self.NGRAM_SIZE)
+                        ).most_common(1)
+                        if len(most_common_ngram) == 0:
+                            return [(0, len(document), 0.0)]
+                        ngram, count = most_common_ngram[0]
+                        if count <= 1:
+                            return [(0, len(document), 0.0)]
+                        total_chars = sum(len(w) for w in document.normalized_words)
+                        score = sum(len(w) for w in ngram) * count / total_chars
+                        score = round(score, PRECISION)
+                        return [(0, len(document), score)]
+            """, block="block", language="python"),
+        ),
+        Details(
+            Summary("Implementations from DataTrove"),
+            D_code("""
+            def get_n_grams(words: list[str], n: int) -> list[str]:
+                return [" ".join(words[i : i + n]) for i in range(len(words) - n + 1)]
+            def find_top_duplicate(x: list[str]) -> int:
+                counter = Counter()
+                for element in x:
+                    counter[element] += 1
+                top_n_gram = counter.most_common(1)[0]
+                return len(top_n_gram[0]) * top_n_gram[1]
+            ...
+            for n, n_frac in self.top_n_grams:
+                n_grams = get_n_grams(words, n)
+                if not n_grams:
+                    continue
+                top_char_length = find_top_duplicate(n_grams)
+                if top_char_length / len(text) > n_frac:
+                    return False, f"top_n_gram"
+            """, block="block", language="python"),
+        ),
         P("""
         There are almost no contradictions between above implementations of fractions of characters in the most common
         n-gram. The main process involves counting the occurrences of each n-gram and selecting the most common one. The
         In practice, documents affected by this rule — where the most common n-gram exceeds a given threshold and occurs
         only once — tend to be short.
         """),
+        Details(
+            Summary("TxT360 Implementation"),
+            D_code("""
+            def all_ngram_counts_new(words) -> List[Tuple[int, CounterType[Tuple[str, ...]]]]:
+                return [(n, list(zip(*[words[i:] for i in range(n)]))) for n in range(2, 11)]
+            ...
+            all_counts = all_ngram_counts_new(words)
+            count_most_common_ngrams = (2, 3, 4)
+            for n, ngram_counts in all_counts:
+                if not ngram_counts:
+                    continue
+                if n in count_most_common_ngrams:
+                    most_common_ngram, count = Counter(ngram_counts).most_common(1)[0]
+                    value = count * sum(len(w) for w in most_common_ngram) / character_count
+                    attrs.fraction_of_characters_in_most_common_ngram.append((n, value))
+            """, block="block", language="python"),
+        ),
         Details(
             Summary("Sample documents filtered by the fraction of characters in the most common n-grams (n=2,3,4)"),
             DV(
         fraction of characters contained within all duplicate n-grams, taking care not to count characters that occur in
         overlapping n-grams more than once.
         """),
+        Details(
+            Summary("Implementations from Dolma"),
+            D_code("""
+            def all_ngram_counts(words) -> List[Tuple[int, CounterType[Tuple[str, ...]]]]:
+                return [(n, Counter(list(zip(*[words[i:] for i in range(n)])))) for n in range(2, 11)]
+            ...
+            all_counts = all_ngram_counts(words)
+            for n, ngram_counts in all_counts:
+                if not ngram_counts:
+                    continue
+                if n in count_most_common_ngrams:
+                    ...
+                else:
+                    ng_char_count = sum(count * sum(len(w) for w in ng) for ng, count in ngram_counts.items())
+                    value = sum(
+                        count * sum(len(w) for w in ng) for ng, count in ngram_counts.items() if count > 1
+                    ) / max(ng_char_count, 1)
+                    attrs.fraction_of_characters_in_duplicate_ngrams.append((n, value))
+            """, block="block", language="python"),
+        ),
+        Details(
+            Summary("Implementations from RedPajama-V2"),
+            D_code("""
+            class Base_RPS_Frac_Chars_In_Dupe_NGrams(RPSBase):  # noqa
+                ## Base class for calculating the fraction of characters in duplicate word N-grams. This operates on the lower-cased, punctation removed content. The function also ensures that characters in overlapping ngrams are only counted once.
+                NGRAM_SIZE: int = None
+                __slots__ = []
+                def __call__(self, document: Document) -> SignalType:
+                    if self.NGRAM_SIZE is None:
+                        raise NotImplementedError(
+                            "NGRAM_SIZE must be set in the subclass"
+                        )
+                    if len(document.normalized_words) < self.NGRAM_SIZE:
+                        return [(0, len(document), 0.0)]
+                    # fetch the ngrams from the document if they exist, otherwise
+                    # compute them
+                    doc_n_grams = (
+                            getattr(document, f"norm_self.NGRAM_SIZEgrams", None)
+                            or
+                            tuple(form_ngrams(
+                                iter(document.normalized_words), self.NGRAM_SIZE
+                            ))
+                    )
+                    # keep only ngrams which occur at least twice
+                    ngram_dupes =
+                        ngram for ngram, count in Counter(doc_n_grams).items() if count > 1
+                    duplicated_grams = np.zeros(len(document.normalized_words), dtype=int)
+                    i = 0
+                    for ngram in doc_n_grams:
+                        if ngram in ngram_dupes:
+                            duplicated_grams[i: i + self.NGRAM_SIZE] = 1
+                        i += 1
+                    word_lengths = np.array(list(map(len, document.normalized_words)))
+                    chars_duped = np.sum(word_lengths * duplicated_grams)
+                    total_chars = np.sum(word_lengths)
+                    if total_chars == 0:
+                        return [(0, len(document), 0.0)]
+                    score = float(chars_duped / total_chars)
+                    score = round(score, PRECISION)
+                    return [(0, len(document), score)]
+            """, block="block", language="python"),
+        ),
+        Details(
+            Summary("Implementations from DataTrove"),
+            D_code("""
+            def find_all_duplicate(words: list[str], n: int) -> int:
+                n_words = len(words)
+                unique = set()
+                repeated_chars, idx = 0, 0
+                while idx < n_words - n + 1:
+                    n_gram = "".join(words[idx : idx + n])
+                    if n_gram in unique:
+                        repeated_chars += len(n_gram)
+                        idx += n
+                    else:
+                        unique.add(n_gram)
+                        idx += 1
+                assert repeated_chars <= len("".join(words))
+                return repeated_chars
+            ...
+            for n, n_frac in self.dup_n_grams:
+                n_duplicates_char = find_all_duplicate(words, n)
+                if n_duplicates_char / len(text) > n_frac:
+                    return False, f"duplicated_n_grams"
+            """, block="block", language="python"),
+        ),
         P("""
         For the computation of fraction of characters in duplicate n-gram, Dolma uses the number of characters in all
         n-grams (with overlapping) as the denominator, and uses the number of characters in all duplicated n-grams
+        (with overlapping) as the numerator."""),
+        P("""RedPajama V2 uses the number of all characters in (the words of) the document
         (without overlapping) as the denominator, and uses the number of characters that are recognized as part of the
+        duplicate n-gram as the numerator."""),
+        P("""Datatrove uses the number of all characters in the document (including white
         spaces, without overlapping) as the denominator, and uses the number of characters that are recognized as
         duplicate n-gram as the numerator. However, there is a mismatch in DataTrove’s calculation, as the number of
         characters in the duplicated n-grams excludes white spaces, while the total character count of the document
+        does not."""),
+        P("""We decided to use the RedPajama V2 implementation but skip the 1st occurrence of the duplicate n-gram.
         """),
+        Details(
+            Summary("TxT360 Implementation")
+            D_code("""
+            def get_dup_ngram_frac(n, doc_n_grams, text):
+                # fetch the ngrams from the document if they exist, otherwise compute them
+                # doc_n_grams = list(zip(*[words[i:] for i in range(n)]))
+                duplicated_grams = np.zeros(len(text.split()), dtype=int)
+                unique_ngrams = set()
+                for i, ngram in enumerate(doc_n_grams):
+                    if ngram in unique_ngrams:
+                        duplicated_grams[i: i + n] = 1
+                    else:
+                        unique_ngrams.add(ngram)
+                word_lengths = np.array(list(map(len, text.split())))
+                chars_duped = np.sum(word_lengths * duplicated_grams)
+                total_chars = np.sum(word_lengths)
+                return float(chars_duped / total_chars)
+            def all_ngram_counts_new(words) -> List[Tuple[int, CounterType[Tuple[str, ...]]]]:
+                return [(n, list(zip(*[words[i:] for i in range(n)]))) for n in range(2, 11)]
+            ...
+            all_counts = all_ngram_counts_new(words)
+            count_most_common_ngrams = (2, 3, 4)
+            for n, ngram_counts in all_counts:
+                if not ngram_counts:
+                    continue
+                if n in count_most_common_ngrams:
+                    ...
+                else:
+                    score = get_dup_ngram_frac(n, ngram_counts, text)
+                    attrs.fraction_of_characters_in_duplicate_ngrams.append((n, score))
+            """, block="block", language="python"),
+        ),
+        Details(
+            Summary("An example to show the difference between above implementations"),
+            P("""
+            Considering n = 5 and the sample sentence:
+            "word_a word_b word_c word_d word_e word_f word_g word_a word_b word_c word_d word_e word_f word_g word_a word_b word_c"
+            In Dolma's implementation, there are 13 5-grams in total with 6 duplicated 5-grams. The resulting fraction of characters in duplicate 5-gram is 6/13.
+            In RedPajama's V2 implementation, there are 17*6 characters in total and 14*6 characters that are contained in duplicate 5-grams. The fraction is 14/17.
+            In DataTrove's implementation, there are 17*6 + 16(white spaces) characters in total and 10 duplicated 5-grams after excluding the first occurrence. The resulting fraction number is 10*6/(17*6+16).
+            In our implementation, there are 17*6 characters in total with 10*6 characters that are duplicated after excluding the first occurence. This results in a fraction of 10/17.
+            """),
+        ),
+        H4("
         H5(
             "Sample Documents Filtered by the Fraction of Characters in Duplicated N-grams (n=5,...,10)"
         ),
         works ([2], [3], [6]), we remove the documents if more than 30% of the lines end with an ellipsis or more than
         90% of lines start with a bullet point.
         """),
+        Details(
+            Summary("Ellipsis Symbol Identification Implemetations"),
+            P("Dolma: "),
+            D_code("""
+            ELLIPSIS_SYMBOLS = ("…")
+            """, block="block", language="python"),
+            P("RedPajamaV2: "),
+            D_code("""
+            ELLIPSIS_SYMBOLS = ("...", "…")
+            """, block="block", language="python"),
+            P("DataTrove: "),
+            D_code("""
+            ELLIPSIS_SYMBOLS = ("...", "…")
+            """, block="block", language="python"),
+            P("TxT360: "),
+            D_code("""
+            ELLIPSIS_SYMBOLS = ("...", "…", "[...]", "[…]")
+            """, block="block", language="python"),
+        ),
+        Details(
+            Summary("Bullet Point Identification Implemetations"),
+            P("Dolma: ")
+            D_code("""
+            BULLET_POINTS = ("*", "-"
+            """, block="block", language="python"),
+            P("RedPajamaV2: ")
+            D_code("""
+            BULLET_POINT_SYMBOLS = (
+                "•",  # bullet point
+                "‣",  # triangular bullet point
+                "▶",  # black right pointing triangle
+                "◀",  # black left pointing triangle
+                "◦",  # white bullet point
+                "■",  # black square
+                "□",  # white square
+                "▪",  # black small square
+                "▫",  # white small square
+                "–",  # en dash
+            )
+            """, block="block", language="python"),
+            P("DataTrove: "),
+            D_code("""
+            BULLET_POINT_SYMBOLS = ("•" , "-")
+            """, block="block", language="python"),
+            P("TxT360: "),
+            D_code("""
+            BULLET_POINT_SYMBOLS = (
+                "•",  # • bullet point
+                "‣",  # ‣ triangular bullet point
+                "▶",  # ▶ black right pointing triangle
+                "◀",  # ◀ black left pointing triangle
+                "◦",  # ◦ white bullet point
+                "■",  # ■ black square
+                "□",  # □ white square
+                "▪",  # ▪ black small square
+                "▫",  # ▫ white small square
+                "-",  # - en dash
+                "–",  # – dash
+                "—",  # — zh dash
+                "*",  # * star
+            )
+            """, block="block", language="python"),
+        ),
         Details(
             Summary("Sample documents that are filtered out by line-wise heuristics"),
             DV(
                 "Sample documents that are filtered out by line-wise heuristics",
             ),
         ),
         H4("3.3 Statistics-based Heuristics"),
         P("We summarize other statistics-based rules originated from Gopher [7] in this section. The statistics can be used include:"),
         Ul(
         Details(
             Summary("Implementations from Dolma"),
             D_code("""
+            words = text.split()
+            word_count = len(words)
             """, block="block", language="python"),
         ),
         Details(
             Summary("Implementations from RedPajama-V2"),
             D_code("""
+            # the normalized content: lowercased and punctuation removed
+            self._normalized_content = normalize(content)
+            self._normalized_words = tuple(self._normalized_content.split())
+            self._num_normalized_words = len(self._normalized_words)
+            ...
+            def normalize(
+                   text: str,
+                   remove_punct: bool = True,
+                   lowercase: bool = True,
+                   nfd_unicode: bool = True,
+                   white_space: bool = True
+            ) -> str:
+               #Normalize the text by lowercasing and removing punctuation.
+               # remove punctuation
+               if remove_punct:
+                   text = text.translate(TRANSLATION_TABLE_PUNCTUATION)
+               # lowercase
+               if lowercase:
+                   text = text.lower()
+               if white_space:
+                   text = text.strip()
+                   text = re.sub(r"\s+", " ", text)
+               # NFD unicode normalization
+               if nfd_unicode:
+                   text = unicodedata.normalize("NFD", text)
+               return text
             """, block="block", language="python"),
         ),
         Details(
             Summary("Implementations from DataTrove"),
             D_code("""
+            words = self.tokenizer.word_tokenize(text)
+            n_words = len(words)
+            non_symbol_words = [w for w in words if any(ch not in PUNCTUATION_SET for ch in w)]
+            n_non_symbol_words_words = len(non_symbol_words)
             """, block="block", language="python"),
         ),
         P("""
         Details(
             Summary("Implementations from RedPajama-V2"),
             D_code("""
+            class RPS_Doc_Num_Sentences(RPSBase):  # noqa
+             ##The number of sentences in the content. This is calculated using the regex r'[^.!?]+[.!?]*'
+            SENT_PATTERN = re.compile(r'[^.!?]+[.!?]*', flags=re.UNICODE)
+            __slots__ = ()
+            def __call__(self, document: Document) -> SignalType:
+                ##count the number of sentences in the content using regex
+                score = float(len(self.SENT_PATTERN.findall(document.raw_content)))
+                return [(0, len(document), score)]
             """, block="block", language="python"),
         ),
         P("""
         Details(
             Summary("TxT360 Implementation"),
             D_code("""
+            from nltk.tokenize import sent_tokenize
+            ...
+            def count_sentences(text):
+                sentences = sent_tokenize(text)
+                return len(sentences)
+            ...
+            attrs.num_of_sentences = count_sentences(text)
             """, block="block", language="python"),
         ),
         Details(
             Summary("Implementations from Dolma"),
             D_code("""
+            SYMBOLS = ("#", "…")
+            ...
+            attrs.symbol_to_word_ratio = sum(1 for word in words if any(s in word for s in SYMBOLS)) / max(
+                        word_count, 1
+                    )
             """, block="block", language="python"),
         ),
         Details(
             Summary("Implementations from RedPajama-V2"),
             D_code("""
+            class RPS_Doc_Symbol_To_Word_Ratio(RPSBase):  # noqa
+    ##The ratio of symbols to words in the content. This is analogous to
+    ##the signal used in Gopher. Symbols are defined "#", "...", and "…".
+                SYMBOLS = ("#", "...", "…")
+                __slots__ = ()
+                def __call__(self, document: Document) -> SignalType:
+                    num_words = document.num_raw_words
+                    if num_words == 0:
+                        return [(0, len(document), None)]
+                    # count the number of symbols in the content
+                    num_symbols = float(sum(
+                        document.raw_content.count(x) for x in self.SYMBOLS
+                    ))
+                    score = num_symbols / num_words
+                    score = round(score, PRECISION)
+                    return [(0, len(document), score)]
             """, block="block", language="python"),
         ),
         Details(
             Summary("Implementations from DataTrove"),
             D_code("""
+            if self.max_symbol_word_ratio and text.count("#") / n_words > self.max_symbol_word_ratio:
+                return False, "gopher_too_many_hashes"
+            if self.max_symbol_word_ratio and (text.count("...") + text.count("…")) / n_words > self.max_symbol_word_ratio:
+                return False, "gopher_too_many_ellipsis"
             """, block="block", language="python"),
         ),
         Details(
             Summary("TxT360 Implementation"),
             D_code("""
+            SYMBOLS = ("#", "...", "…")
+            ...
+            symbol_pattern = re.compile("|".join(re.escape(symbol) for symbol in SYMBOLS))
+            ...
+            attrs.symbol_to_word_ratio = sum(1 for word in words if symbol_pattern.search(word)) / word_count
             """, block="block", language="python"),
         ),