Can you share the notebook pls

#1
by smjain - opened

Is it possible to share the notebook for finetuning colpali

Sure! I used this wonderful notebook from @tonywu71 . I only made some very small changes to how the dataset was loaded but otherwise everything is kept as in the original notebook :)

Thanks a lot . I also just saw it . Will revise the data loader for my dataset. Thanks

SG @smjain . Excited to see what you train!

Sure will keep you posted, as soon as I have a stab on it.

I would also be very interested in how you changed the data loading. The way I got the notebook is that the trainer plucks query and anwer from the dataset for the training, right? How did you handle your case where you have multiple queries and explanations in one row?

@wirtsi for now, I only trained on one of the columns but I think it could be worth trying:

  • randomly choosing one query type
  • training on multiple queries per image

I am lazy so I would just convert the dataset up front to keep the training loop the same. For the UFO dataset something like this should work:

from datasets import Dataset, Image
import polars as pl

df = pl.scan_parquet('hf://datasets/davanstrien/ufo-ColPali/data/train-00000-of-00001.parquet')
df = df.filter(pl.col("parsed_into_json")==True)
df = df.drop('raw_queries')
unpivoted_df = df.unpivot(
        index=['image'],
        on=['broad_topical_query', 'specific_detail_query', 'visual_element_query'],
        variable_name='query_type',
        value_name='query'
    )

    # Drop the query_type column as it's no longer needed. 
    # collect the results 
result_df = unpivoted_df.drop('query_type').collect()
# convert to dataset.Dataset
ds = Dataset.from_polars(result_df)
# cast the image column to an Image type
ds = ds.cast_column("image", Image())

Fantastic, thank you so much 😍

Sign up or log in to comment