Can you share the notebook pls
Is it possible to share the notebook for finetuning colpali
Thanks a lot . I also just saw it . Will revise the data loader for my dataset. Thanks
SG @smjain . Excited to see what you train!
Sure will keep you posted, as soon as I have a stab on it.
I would also be very interested in how you changed the data loading. The way I got the notebook is that the trainer plucks query and anwer from the dataset for the training, right? How did you handle your case where you have multiple queries and explanations in one row?
@wirtsi for now, I only trained on one of the columns but I think it could be worth trying:
- randomly choosing one query type
- training on multiple queries per image
I am lazy so I would just convert the dataset up front to keep the training loop the same. For the UFO dataset something like this should work:
from datasets import Dataset, Image
import polars as pl
df = pl.scan_parquet('hf://datasets/davanstrien/ufo-ColPali/data/train-00000-of-00001.parquet')
df = df.filter(pl.col("parsed_into_json")==True)
df = df.drop('raw_queries')
unpivoted_df = df.unpivot(
index=['image'],
on=['broad_topical_query', 'specific_detail_query', 'visual_element_query'],
variable_name='query_type',
value_name='query'
)
# Drop the query_type column as it's no longer needed.
# collect the results
result_df = unpivoted_df.drop('query_type').collect()
# convert to dataset.Dataset
ds = Dataset.from_polars(result_df)
# cast the image column to an Image type
ds = ds.cast_column("image", Image())
Fantastic, thank you so much π
You can try this if it helps https://colab.research.google.com/drive/1eYNzzjj9gwuLkjQzyme6If7aHFF0md3j?usp=sharing