Dataset question regarding eos

#2
by chrisgru - opened

Hi,
first thank you for everything!
I have a question since I'm experimenting and debugging a lot.
Do you also see double eos tokens being added to the data when training ?
You can check using:
python -m axolotl.cli.train your_config.yml --prepare_ds_only --debug --debug_text_only --debug_num_examples 2

For a simple dataset, I have this (eos \n eos again):
<|im_start|>user
Bonjour!<|im_end|>

<|im_start|>assistant
Bonjour!<|im_end|>

<|im_end|><|im_start|>user
Salutations!<|im_end|>

<|im_start|>assistant
Salutations!<|im_end|>

<|im_end|>

The dataset is like this:
{"conversations"[{"from": "human", "value": "Bonjour!"},{"from": "gpt","value": "Bonjour!"}]}
{"conversations"[{"from": "human", "value": "Salutations!"},{"from": "gpt","value": "Salutations!"}]}

I assume since these are 2 eos tokens separated by \n, it doesn't matter for performance in the end, but I just wanted to run this by you.

Cognitive Computations org

this is really interesting. I will examine this.

chrisgru changed discussion status to closed
chrisgru changed discussion status to open
Cognitive Computations org
edited Nov 15, 2023

I've spent half day trying and failing to figure out, why the moment I loaded this model, my obbabooga web ui broke completely, couldn't follow ChatML format anymore.
The main problem, wasn't just this this model, every other ChatML model stopped functioning no matter what I tried (short of complete webui purge and reinstall), but it was working flawlessly with every ChatML model all day until I attempted loading this one.
Still think it's largely due to some bugs in web ui, it usually is ridden with those, but maybe some issues with tokenizer issues had some play too.
I'm re-downloading bloke's quants once more to test.

the 2 eos tokens that appear here should not bother Ooba. Test the model in another way if you can.
The 2 eos tokens are added:

  1. First as a separator: https://github.com/lm-sys/FastChat/blob/main/fastchat/conversation.py#L163
    ret += role + "\n" + message + self.sep + "\n"
  2. axolotl adds another eos:
    https://github.com/OpenAccess-AI-Collective/axolotl/blob/1a6309c8a633a6fe17b2ffebbbc0353565f376e5/src/axolotl/prompt_tokenizers.py#L392

;;this should be the assistant response, should end with an eos token
.....
res = self._tokenize(
turn,
add_eos_token=True,
strip_bos_token=True,
)

And as such we have at the end of each conversation:
<|im_end|>\n<|im_end|>

Cognitive Computations org

is there something I need to change?

Cognitive Computations org

when I train dolphin 3.0 I can modify the axolotl config to train it differently

Hi Eric,

There is not much we can change at this point. I mean, we can submit a PR to axolotl and change the above self._tokenize(...) to set add_eos_token to False. (Will this affect other templates that use the ShareGPTPromptTokenizingStrategy class ? maybe, that is why it needs a bit of attention. I'll try but time is limited on my side, that is why no PR as of yet from me)
We also need to create a new FastChat conversation class to remove the \n at the end, like this:

register_conv_template(
Conversation(
name="chatml2",
system_template="<|im_start|>system\n{system_message}",
system_message="You are a helpful assistant.",
roles=["<|im_start|>user", "<|im_start|>assistant"],
sep_style=SeparatorStyle.CHATML,
sep="<|im_end|>"
)
)

// The original chatml conversation has sep="<|im_end|>\n" and as such FastChat and Axolotl both added a \n at the end.

Once this is done, the template will be correct. All this happens because multi turn conversation was not used that much previously and I guess people did not notice.
This may or may not affect the current trained Dolphin model. My not yet so expert opinion is that it affects it a little bit.

Sign up or log in to comment