pretokenizer Regex issues?

#278
by hpcpony - opened

I'm curious about the pre-tokenizer in tokenizer.json

  "pre_tokenizer": {
    "type": "Sequence",
    "pretokenizers": [
      {
        "type": "Split",
        "pattern": {
          "Regex": " ?[^(\\s|[.,!?…。,、।۔،])]+"
        },
        "behavior": "Isolated",
        "invert": false
      },
      {
        "type": "ByteLevel",
        "add_prefix_space": false,
        "trim_offsets": true,

If I try to optimum-cli export openvino ... it outputs an error message that the it can't parse the Regex. I do still get some sort of converted model, but I haven't gotten to the point of trying to use it. optimum-cli export onnx ... does not complain (though I've got other onnx issues).

If I take the Regex and put is in various regex validators on the web, many of them do flag is as not valid. (regex101.com would accept is but only as Java or Rust flavor).

I don't claim to be an regex expert, but looking at the pattern and seeing what various validators say it should do, the pattern doesn't really look like it does what I would think it would be doing.

So, I'm just wondering... Is it really a valid regex pattern? What is it supposed to be matching? Is there a simpler (valid) regex pattern that would do the right thing? How can I tell if it is even used or can I just replace it with some benign pattern that's valid?

Thanks.

BigScience Workshop org

The pretokenizer works as intended, you can test it like this:

tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom")
print(tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str("Random, test of: the tokenizer with some(?) random text!"))

[('Random', (0, 6)), (',', (6, 7)), ('Ġtest', (7, 12)), ('Ġof:', (12, 16)), ('Ġthe', (16, 20)), ('Ġtokenizer', (20, 30)), ('Ġwith', (30, 35)), ('Ġsome', (35, 40)), ('(?)', (40, 43)), ('Ġrandom', (43, 50)), ('Ġtext', (50, 55)), ('!', (55, 56))]
BigScience Workshop org
edited Jul 8

See page 18 of the bloom paper (section Pre-tokenizer) for more background on the design decisions.

It was implemented like this if you want to reproduce with tokenizers:

from tokenizers import Regex, pre_tokenizers

tokenizer.pre_tokenizer = pre_tokenizers.Sequence(
                [
                    pre_tokenizers.Split(Regex(r" ?[^(\s|[.,!?…。,、।۔،])]+"), "isolated"),
                    pre_tokenizers.ByteLevel(add_prefix_space=False, regex_type="no_regex"),
                ]

Not sure why it fails with openvino but i'd search the issues in the tokenizers repo to see if others have run into this before.

Ok, thanks.

hpcpony changed discussion status to closed

Sign up or log in to comment