Clarification towards the different models
Hey Kanzhi Cheng,
I am a researcher which is trying to use you models for UI automation tasks. I wanted to first say that your model is really helpful and impressive.
Can you please help me to clarify the different models found in your account?
- What is their difference (base/mind2web/aitw/miniwob)?
- what each of them were trained on (only pretraining? pretraining + trainining of the dataset from its name only?)
- are the models improvements over each other? or each was trained on different data?
- were all of them fine tuned on ScreenSpot?
Thank you
Hi, sorry for a bit late.
- The SeeClick (base) is a model obtained through our proposed GUI grounding pre-training, possessing general GUI localization capacity. And mind2web, aitw and miniwob are models based on SeeClick-base, fine-tuned on three different downstream tasks (web, android, simplified web) respectively, enabling them to perform tasks within GUI environments.
- The SeeClick-base is only pre-training, and mind2web/aitw/miniwob is based on SeeClick on fine-tuned on different downstream tasks.
- No model was fine-tuned on ScreenSpot, nor should it be. This is because ScreenSpot serves as an evaluation benchmark for testing zero-shot GUI grounding capabilities. Our paper reports the results of SeeClick-base on ScreenSpot.
For more details, please check our paper and github https://huggingface.co/papers/2401.10935.
If you have any other questions, don't hesitate to tell me.
Thank you very much!
This is very helpful.
So just to clarify, the mind2web/aitw/miniwob models are the versions of the base model trained on the respective training datasets of each one?
Moreover, i wanted to ask if it is possible to ask for the SeeClick model to output more than 1 candidate?
Thanks again!
the mind2web/aitw/miniwob models are the versions of the base model trained on the respective training datasets of each one?
Yes. In fact, these models are uploaded to allow one to reproduce the results of the agent task in the paper, or directly apply them directly to similar GUI agent scenarios.
output more than 1 candidate?
The current model should only generate multiple candidates by some decoding methods like beam search, which you can refer to the origin Qwen-VL repo. But I think continual fine-tuning SeeClick with a small amount of data may give it the ability to generate multiple candidates.