tokenizer-arena / stats /compression_rate /ClueAI.ChatYuan-large-v2 @ cc100.en.diff.json
xu-song's picture
add compression_rate details
a4208a2
raw
history blame
No virus
6.88 kB
[
{
"text": "No extra costs for access? Asking for a disabled access hack if I want to take my chair (Quickie Ti - weighs little, I can just pick it up and put it in, no need for time-consuming ramps), to the pub here in Wirral jacks up the normal fair by about £1.50.",
"decoded_text": "No extra costs for access? Asking for a disabled access hack if I want to take my chair (Quickie Ti - weighs little, I can just pick it up and put it in, no need for time-consuming ramps), to the pub here in Wirral jacks up the normal fair by about <unk>1.50.",
"diff": [
"replace text[249:250] --> decoded_text[249:254] '£' --> '<unk>'"
],
"n_oov_chars": 1,
"oov_ratio": 0.00392156862745098,
"oov_charset": "[\"£\"]"
},
{
"text": "and yeah im a boy,and no, im not g*y, im a nice guy. i dont love his songs or anything , but he's not that bad tbh.",
"decoded_text": "and yeah im a boy,and no, im not g*y, im a nice guy. i dont love his songs or anything, but he's not that bad tbh.",
"diff": [
"delete text[86:87] --> decoded_text[86:86] ' ' --> ''"
],
"n_oov_chars": 0,
"oov_ratio": 0.0,
"oov_charset": "[]"
},
{
"text": "Justin serenaded wonderful or better than a great I like popular songs, particularly as it is talented. all those who hate Justin are g**s because they feel jealous of him because he is handsome at the same time a rising singer and a small age. I myself appreciate the wonderful artist with this beautiful and talented .",
"decoded_text": "Justin serenaded wonderful or better than a great I like popular songs, particularly as it is talented. all those who hate Justin are g**s because they feel jealous of him because he is handsome at the same time a rising singer and a small age. I myself appreciate the wonderful artist with this beautiful and talented.",
"diff": [
"delete text[318:319] --> decoded_text[318:318] ' ' --> ''"
],
"n_oov_chars": 0,
"oov_ratio": 0.0,
"oov_charset": "[]"
},
{
"text": "Soften the landing zones with a pair of Rubber Mats , made from dyed rubber chips, heat compressed and available in dark green or brick red.",
"decoded_text": "Soften the landing zones with a pair of Rubber Mats, made from dyed rubber chips, heat compressed and available in dark green or brick red.",
"diff": [
"delete text[51:52] --> decoded_text[51:51] ' ' --> ''"
],
"n_oov_chars": 0,
"oov_ratio": 0.0,
"oov_charset": "[]"
},
{
"text": "​EEI Members have access to a wide range of reports, publications, communications, and other resources. In order to access the resources below, a member log in is required.",
"decoded_text": "EEI Members have access to a wide range of reports, publications, communications, and other resources. In order to access the resources below, a member log in is required.",
"diff": [
"delete text[0:1] --> decoded_text[0:0] '\\u200b' --> ''"
],
"n_oov_chars": 1,
"oov_ratio": 0.005813953488372093,
"oov_charset": "[\"​\"]"
},
{
"text": "​Launched in 2017, AUPSE is a senior executive knowledge exchange and peer-to-peer networking platform created to accelerate operational excellence in the African electric power sector.",
"decoded_text": "Launched in 2017, AUPSE is a senior executive knowledge exchange and peer-to-peer networking platform created to accelerate operational excellence in the African electric power sector.",
"diff": [
"delete text[0:1] --> decoded_text[0:0] '\\u200b' --> ''"
],
"n_oov_chars": 1,
"oov_ratio": 0.005405405405405406,
"oov_charset": "[\"​\"]"
},
{
"text": "Would love some tatts, but too much of a wimp to get them! 😥",
"decoded_text": "Would love some tatts, but too much of a wimp to get them! <unk>",
"diff": [
"replace text[59:60] --> decoded_text[59:64] '😥' --> '<unk>'"
],
"n_oov_chars": 1,
"oov_ratio": 0.016666666666666666,
"oov_charset": "[\"😥\"]"
},
{
"text": "We're not so rough and over the top these days, so they miiiiight survive ._.",
"decoded_text": "We're not so rough and over the top these days, so they miiiiight survive._.",
"diff": [
"delete text[73:74] --> decoded_text[73:73] ' ' --> ''"
],
"n_oov_chars": 0,
"oov_ratio": 0.0,
"oov_charset": "[]"
},
{
"text": "Just finished Hulse's \"Black River\" and simply adored the book. So pretty, overall, and much like the Kent Haruf novels, such as \"Plainsong\" that I've enjoyed over the years. \"Black River\" is surely one of the best five I've read this year. Solid Pulitzer choice, in my opinion. Side note: As I've mentioned before, I surely don't understand all of the hoopla surrounding \"The Sellout,\" with so many other worthy contenders. But, what do I know? I'm only a reader. :-) Read on ...",
"decoded_text": "Just finished Hulse's \"Black River\" and simply adored the book. So pretty, overall, and much like the Kent Haruf novels, such as \"Plainsong\" that I've enjoyed over the years. \"Black River\" is surely one of the best five I've read this year. Solid Pulitzer choice, in my opinion. Side note: As I've mentioned before, I surely don't understand all of the hoopla surrounding \"The Sellout,\" with so many other worthy contenders. But, what do I know? I'm only a reader. :-) Read on...",
"diff": [
"replace text[476:480] --> decoded_text[476:479] ' ...' --> '...'"
],
"n_oov_chars": 0,
"oov_ratio": 0.0,
"oov_charset": "[]"
},
{
"text": "I really don't understand all of the hoopla over THE SELLOUT. Just a so-so book, in my opinion. Minor work. I struggled through it, and can never get back the time spent on that tome. EILEEN and HONEYDEW are sooooooo much better, not to mention THE TURNER HOUSE, TSAR, DID YOU EVER, and others. I'm reading DELICIOUS FOODS right now, and think it's a major-serious contender as well. BLACK RIVER is next on my list, and I can't wait. But, what do I know? :-) Read on ...",
"decoded_text": "I really don't understand all of the hoopla over THE SELLOUT. Just a so-so book, in my opinion. Minor work. I struggled through it, and can never get back the time spent on that tome. EILEEN and HONEYDEW are sooooooo much better, not to mention THE TURNER HOUSE, TSAR, DID YOU EVER, and others. I'm reading DELICIOUS FOODS right now, and think it's a major-serious contender as well. BLACK RIVER is next on my list, and I can't wait. But, what do I know? :-) Read on...",
"diff": [
"replace text[466:470] --> decoded_text[466:469] ' ...' --> '...'"
],
"n_oov_chars": 0,
"oov_ratio": 0.0,
"oov_charset": "[]"
}
]