vikhyatk/moondream2 · Vision Encoder does not scale well on batched images as input

Hi, awsome model, but when I was using this, I noticed that following batch_answer() funcion, the batched image encoding

with this dataloader

from torch.utils.data import Dataset
    class ImageFolderDataset(Dataset):
        def __init__(self, folder_path):
            self.folder_path = folder_path
            self.image_files = [f for f in os.listdir(folder_path) if f.lower().endswith(('.png', '.jpg', '.jpeg'))]

        def __len__(self):
            return len(self.image_files)

        def __getitem__(self, idx):
            image_path = os.path.join(self.folder_path, self.image_files[idx])
            image = Image.open(image_path).convert('RGB')
            return image

    def collate_fn(batch):
        return batch


    dataloader= DataLoader(dataset, batch_size=batch_size, shuffle=True, num_workers=1, collate_fn=collate_fn, drop_last=True)

Does not seem to have batched speedup when I run with different batch sizes in this code

for i, batch in enumerate(dataloader):
model.encode_image(batch)

and for batch size of 1, its around 1s per img, but batch size of 10, its around 10s per 10 imgs