Download datasets:
- Download and decompress tsv file from here: https://github.com/google-research-datasets/wit/blob/main/DATA.md
- Use
prepare_wit.py
to download images from Wikipedia. - Use
discard_incorrect_files
to filter out corrupt files.TODO: Still some corrupt files are being kept.
TODO: Make it a CLI
. - Finally, use
run-clip.sh
to train.