Code problem: Where is the code for tokenizing in the preprocessing of pre-training data?
I read the code and felt it is of very high quality, so wanted to use it to run our experimental data, but ran into some problems.There is no code for tokenizing in file Geneformer/examples/pretraining_new_model/obtain_nonzero_median_digests.ipynb. Later, I found it in the file Geneformer/examples/tokenizing_scRNAseq_data.ipynb, but after reading it, I felt that it did not match the task. Found out that tokenize requires a loom file, but the given code generates a pickle file. The tokenize code for pre-training should be somewhere else I haven't found it yet. Where is the code for tokenizing in the pre-training?
Thanks for your question! The code for tokenizing is the one you mentioned: tokenizing_scRNAseq_data.ipynb
Tokenizing data for pretraining and downstream tasks is importantly the same process.
The other file you mentioned generates the median values pkl that is used for tokenizing if you have a new corpus. We provide the one for our token dictionary / corpus.