Can genes filtered before cell_state perturbation?

#439
by ZYSK-huggingface - opened

Hi Christina!
I was wondering if I could filtered genes according to their original expression in my raw data before cell state perturbation, beacuse perturbate 'all' genes cost too much time and space, so I wanted to reduce the genes range to do cell_state perturbation. I think I could easily filter a list of genes and obtain a gene list, but if I do not set the parameter ‘genes_to_perturbed’ to 'all' but a large list, or even few genes that I'm instered in, will it be any different from that of 'all' mode, expecially for FDR and pvalue?

Besides, I wandered if there is any differnence when set 'emb_mode' to 'cell' but not 'cell_and_gene'. At first I use 'cell_and_gene', and I found generated gene_emb_dict having no use in "goal_state_shift" mode. But when I use 'cell', beacuse the whole running program cost so much time, I simply test the intermidate file to obtain "goal_state_shift" result, and found almost all genes row(which means, I thought gene is one by one operated and perturbed all cell state, so the intermidate should only include not include all genes , that is to say should not include all rows), is this a mistake or something? or is it a total different logic?

Furthermore, in 'cell_and_gene', the intermidate file like ,for example , for the batch0, '……cell_embs_0batch-1' is useful, but '……cell_embs_0batch57' is almost blank and have no contribution to the result. That works for all the batch, all the batch with'……xxbatch-1'is useful for generate result. (Of course, I ignore all the 'gene_embs_dict')

But in 'cell', seems all the intermidate file, Whatever file suffix it is, they all have considerable content and have a contribution, and they are generated more quickily than above.

Pictures show different result between 'cell_and_gene' and 'cell':
①‘cell_and_gene’, I got aboult 4532 row genes and this cost me alomst 5 days(I think the time cost on generating 'gene_embs_dict', which have no use and cost much time, that is why I turned to 'cell' mode)
image.png
②‘cell’,which is so strange that I only run for less than one day and I got so much genes perturbed?

image.png

Look forward to your reply !

Update:
I have obtained a further result with time went by. AND I think I probably understand 'cell' mode, semms it perturb genes not one by one, but perturb cells one by one, which I can tell from the increasing number of N_detections of all genes.

However, I nooticed another problem, as I wrote above, I have a longer running result of 'cell_and_gene' mode(use 95M-pretrained model), and I found N_detections in it semms not correct.

For example, for the gene 'ARID5A' which I previously perturbate single gene perturbation, and I can tell from result from that time, its N_detections are 3452, which align with my raw data:

image.png

However, in this 95M pretrained model with 'cls_and_gene' cell state perturbation N_detections are only 2:

image.png

And in my recent 'cell' mode (actually with 30M finetuned model), the 'ARID5A'''s N_detections are 629, which I have confidence in because of the increasing N_detections number as each iteration goes:

image.png

Thank you for your questions. Please refer to the documentation of the InSiIicoPerturber, which contains answers to your questions above: https://geneformer.readthedocs.io/en/latest/geneformer.in_silico_perturber.html

I will write the relevant information from that documentation here to answer your questions:

  • First of all, there are two different ways to calculate a cell embedding, either by mean pooling all the gene embeddings ("cell"), or by taking the embedding of the cls token that is in the front of the cell representation ("cls"). The 30M model did not have a cls token so the only option is "cell". For the 95M model, we would recommend using "cls" to take advantage of this representation, which is in a single token and therefore less affected by the number of total genes in each cell. From here onwards I will assume you are using the new model and discuss this for the "cls" embedding case.
  • You can choose to perturb either A) all genes, with "all" mode, which perturb each gene one by one in each cell up to the maximum number of cells, or B) a set of genes by providing a list of genes. All genes in the list are perturbed together. This is therefore useful if you want to perturb a specific combination of genes, like TBX5 and GATA4, in which case you can add both of them to the list and they will both be perturbed. However, if you would like to perturb just a single gene, you can just provide one gene in the list, like only TBX5. If you want to perturb TBX5 and GATA4, but not together, just each individually, you simply set up the perturbation two times, each with just the list containing TBX5, or the list only containing GATA4. So, to answer your question about whether you can perturb individually a given number of genes, yes you can do this simply by setting up an array of jobs, which is also useful to parallelize them, each with a given gene. Please note, not all genes will contain both of the genes, so if you want to perturb them with the same exact cells, you should filter your dataset before hand only to include the cells that have both genes, but then run the perturbation with each gene individually. If you want to perturb both of them together by providing them as a combination both in the list of genes to perturb, the dataset will be automatically filtered for those cells that contain both of the genes in the same cell, so that they can both be perturbed in combination.
  • Once you set up which genes you would like to perturb, either "all" individually, or a single gene, or a set of genes in combination (there are also modes for combinations so please refer to the documentation for further information about that), the mode determines what result you want to measure as a result of that perturbation. If you select "cls", the results will contain the effect of that perturbation on the given cell's state, by measuring the cosine similarity of an original cell vs that cell with the given perturbation performed. If you set it up to measure movement towards a goal state, the measurement will be the proportion of the movement from the original cell towards the direction of that specific goal state, as opposed to just how far from the original cell it moved without accounting for the direction of movement. Now, the "cls_and_gene" method measures the result of a given perturbation not just on the cell's state, but also the effect on all the individual genes within that cell. This useful to investigate connection between genes within a gene network.
  • Regarding the time of the perturbation: when you set up the perturbation to perturb "all" genes, the model will run inference on the original cell, as well as a simulated perturbed cell for each of the genes. So, if you have 10,000 cells that each have 4096 genes detected, you are running inference on 40,970,000 cells (10,000 original cells + 10,0004096 perturbed cells). If you are only interested in a 2 individual genes, then you would cut the time down because you would only run inference on 30,000 cells (10,000 original cells + 10,0002 perturbed cells).
  • Regarding the space of the perturbation: when you set up the perturbation with the mode "cls", the results will be the "cls" cosine shift for the 40,960,000 simulated cells. So, 40,960,000 numbers. If you set it up to be "cls_and_gene", for each perturbation, you will get not just a "cls" cosine shift but also the cosine shift of each gene in response to that perturbation, so you will get 40,960,000+40,960,000*4095 = ~168 billion numbers. This will also increase the time because instead of comparing just the cosine shift for a single token (original cls compared to perturbed cls), you will calculate the cosine shift for each individual gene in the cell. Therefore, if you don't need the gene-level analysis for your scientific question, it would save time and space to use the "cls" mode instead.
  • Regarding the batches in the intermediate output files: these batches are set up solely to ensure intermediate saving checkpoints and memory clearing to avoid memory constraints. The batches will not be equal because when you reach the end of a given cell, there may not be enough genes to fill the whole batch, depending on how you have set up the perturbation and your batches. You should ignore the numbers in the batches and just provide them to the InSilicoPerturberStats as the outer directory.
  • Regarding the differences in number of detections: if you set a maximum number of 1000 cells with "all", the job will only perturb those 1000 cells, chosen randomly, that may or may not contain a particular gene. If you set it up with a maximum number of 1000 cells and the gene TBX5, it will first filter for cells that contain TBX5, then subsample to 1000, so you will be guaranteed to measure TBX5 perturbations in 1000 cells. If you set it up as "cls_and_gene" for a specific gene like TBX5, the results will contain both the TBX5 effect on the cell state (labeled "cell") as well as the impact of TBX5 on each individual gene, where it will contain the gene name. Because to do this measurement you have to have a cell that expressed both TBX5 and the given gene, this will be different for each gene. For example, if TBX5 appears in 1000 cells, but it is only co-detected with MRF-1 (from your example) in 2 cells, the "cell" row will have N_Detections of 1000, but the MRF-1 row will have N_Detections of 2.
  • Just to note, the way the jobs are set up, if you want to perturb all genes or nearly all genes, it is more efficient to use the "all" option as opposed to running individual jobs for each gene. This is because in the "all" option, you run inference on the original cell only once, as opposed to for each individual gene job.

Hopefully that answers some of your questions. Please refer to the documentation for InSilicoPerturber and InSilicoPerturberStats for more information.

ctheodoris changed discussion status to closed

Thank you for your questions. Please refer to the documentation of the InSiIicoPerturber, which contains answers to your questions above: https://geneformer.readthedocs.io/en/latest/geneformer.in_silico_perturber.html

I will write the relevant information from that documentation here to answer your questions:

  • First of all, there are two different ways to calculate a cell embedding, either by mean pooling all the gene embeddings ("cell"), or by taking the embedding of the cls token that is in the front of the cell representation ("cls"). The 30M model did not have a cls token so the only option is "cell". For the 95M model, we would recommend using "cls" to take advantage of this representation, which is in a single token and therefore less affected by the number of total genes in each cell. From here onwards I will assume you are using the new model and discuss this for the "cls" embedding case.
  • You can choose to perturb either A) all genes, with "all" mode, which perturb each gene one by one in each cell up to the maximum number of cells, or B) a set of genes by providing a list of genes. All genes in the list are perturbed together. This is therefore useful if you want to perturb a specific combination of genes, like TBX5 and GATA4, in which case you can add both of them to the list and they will both be perturbed. However, if you would like to perturb just a single gene, you can just provide one gene in the list, like only TBX5. If you want to perturb TBX5 and GATA4, but not together, just each individually, you simply set up the perturbation two times, each with just the list containing TBX5, or the list only containing GATA4. So, to answer your question about whether you can perturb individually a given number of genes, yes you can do this simply by setting up an array of jobs, which is also useful to parallelize them, each with a given gene. Please note, not all genes will contain both of the genes, so if you want to perturb them with the same exact cells, you should filter your dataset before hand only to include the cells that have both genes, but then run the perturbation with each gene individually. If you want to perturb both of them together by providing them as a combination both in the list of genes to perturb, the dataset will be automatically filtered for those cells that contain both of the genes in the same cell, so that they can both be perturbed in combination.
  • Once you set up which genes you would like to perturb, either "all" individually, or a single gene, or a set of genes in combination (there are also modes for combinations so please refer to the documentation for further information about that), the mode determines what result you want to measure as a result of that perturbation. If you select "cls", the results will contain the effect of that perturbation on the given cell's state, by measuring the cosine similarity of an original cell vs that cell with the given perturbation performed. If you set it up to measure movement towards a goal state, the measurement will be the proportion of the movement from the original cell towards the direction of that specific goal state, as opposed to just how far from the original cell it moved without accounting for the direction of movement. Now, the "cls_and_gene" method measures the result of a given perturbation not just on the cell's state, but also the effect on all the individual genes within that cell. This useful to investigate connection between genes within a gene network.
  • Regarding the time of the perturbation: when you set up the perturbation to perturb "all" genes, the model will run inference on the original cell, as well as a simulated perturbed cell for each of the genes. So, if you have 10,000 cells that each have 4096 genes detected, you are running inference on 40,970,000 cells (10,000 original cells + 10,0004096 perturbed cells). If you are only interested in a 2 individual genes, then you would cut the time down because you would only run inference on 30,000 cells (10,000 original cells + 10,0002 perturbed cells).
  • Regarding the space of the perturbation: when you set up the perturbation with the mode "cls", the results will be the "cls" cosine shift for the 40,960,000 simulated cells. So, 40,960,000 numbers. If you set it up to be "cls_and_gene", for each perturbation, you will get not just a "cls" cosine shift but also the cosine shift of each gene in response to that perturbation, so you will get 40,960,000+40,960,000*4095 = ~168 billion numbers. This will also increase the time because instead of comparing just the cosine shift for a single token (original cls compared to perturbed cls), you will calculate the cosine shift for each individual gene in the cell. Therefore, if you don't need the gene-level analysis for your scientific question, it would save time and space to use the "cls" mode instead.
  • Regarding the batches in the intermediate output files: these batches are set up solely to ensure intermediate saving checkpoints and memory clearing to avoid memory constraints. The batches will not be equal because when you reach the end of a given cell, there may not be enough genes to fill the whole batch, depending on how you have set up the perturbation and your batches. You should ignore the numbers in the batches and just provide them to the InSilicoPerturberStats as the outer directory.
  • Regarding the differences in number of detections: if you set a maximum number of 1000 cells with "all", the job will only perturb those 1000 cells, chosen randomly, that may or may not contain a particular gene. If you set it up with a maximum number of 1000 cells and the gene TBX5, it will first filter for cells that contain TBX5, then subsample to 1000, so you will be guaranteed to measure TBX5 perturbations in 1000 cells. If you set it up as "cls_and_gene" for a specific gene like TBX5, the results will contain both the TBX5 effect on the cell state (labeled "cell") as well as the impact of TBX5 on each individual gene, where it will contain the gene name. Because to do this measurement you have to have a cell that expressed both TBX5 and the given gene, this will be different for each gene. For example, if TBX5 appears in 1000 cells, but it is only co-detected with MRF-1 (from your example) in 2 cells, the "cell" row will have N_Detections of 1000, but the MRF-1 row will have N_Detections of 2.
  • Just to note, the way the jobs are set up, if you want to perturb all genes or nearly all genes, it is more efficient to use the "all" option as opposed to running individual jobs for each gene. This is because in the "all" option, you run inference on the original cell only once, as opposed to for each individual gene job.

Hopefully that answers some of your questions. Please refer to the documentation for InSilicoPerturber and InSilicoPerturberStats for more information.

Thank you so much for your patient and detailed reply ! But I want to make my problems clear concerning the 'differences in number of detections'.

image.png

You can tell from my codes that I use 'all' mode and set max_n_cells =None, with no anchor gene, and use 95M pretrained models with 'cls_and_gene'(at first is 'cls_and_gene', and csv result showed above come from that; the codes in above picture change it to 'cls' later), so I believe , for 'ARID5A', it should align with its actual cell num 3452(this num came from single perturbation result and align with true cell num ), but it only got 2 in this cell_state perturbation, which really bothered me a lot.

Thank you for following up. If you are referring to this screenshot below, this row is for a gene other than ARID5A. It says MRF-1. Please re-read my prior message to understand.
image.png

Thank you for following up. If you are referring to this screenshot below, this row is for a gene other than ARID5A. It says MRF-1. Please re-read my prior message to understand. I used MRF-1 in my example to help make it clear for you.

image.png

Oh, Sorry to mislead you, I think it is beacuse of your differnent dict, when using 30M model and dict , is 'MRF-1'; when using 95M model and dict, is 'ARID5A'. You can see their ensembl_id is identical

image.png

I have an interesting finding, seems that 95model+cls is differnent from 30Mmodel+cell in cell_state perturbation:
30M:just a moment , a lot of files
image.png

95M: same time, just two file:

image.png

I guess, in 30M, perturbation iterate all the genes one cell by one cell; but in 95, perturbation iterate all cells one gene by one gene?

Thank you for clarifying this - the Ensembl IDs were cut off in your screen shot. You mention you were doing many things differently: different modes, different models, different gene set up, etc. All of these things can affect the number of cells and files. One thing I notice though is that your files in the second screenshot are labeled until 57 but I don’t see 0-56. Similarly however, there are some numbers missing in the other one. It would be worth rerunning for a smaller subset of cells to confirm in case somehow the saving was affected, and make sure you run everything exactly the same except for the change you are trying to test. You can also check the number of cells containing the gene of interest in each tokenized dataset, since they have different lengths and gene median dictionaries and will encode the transcriptomes differently. You should make sure that the correct dictionary is used for each model, for both the tokenization and the perturbation. Please note for tokenization there are 3 different dictionaries that are applied, the token, gene median, and mapping dictionary, that all need to harmonize for the model you are testing.

Sign up or log in to comment