phi3 vision visual encoder

by the-future-dev - opened May 21

May 21

Was a paper published about this vision model?
Which visual encoder was used?

bdytx5

May 21

looks like a clip encoder.

Microsoft org May 21

•

We use CLIP-L, the paper will be released later today.

May 21

What is the resolution of the image input?

Microsoft org May 21

The resolution is dynamic based on the input image aspect ratio. The max resolution is 1344x1344.

paul91

May 22

We use CLIP-L, the paper will be released later today.

where is the parper, please show links

paul91

May 22

We use CLIP-L, the paper will be released later today.

Does the visual model freeze during training?

sayedM

May 25

We use CLIP-L, the paper will be released later today.

Are you going to release the paper and the fine-tuning code ?

haohoo

May 27

'img_processor': {'image_dim_out': 1024, 'model_name': 'openai/clip-vit-large-patch14-336', 'name': 'clip_vision_model', 'num_img_tokens': 144}

Please share the Paper URL here or model card.

Microsoft org Jun 4

The updated Phi-3 Technical Report is available at https://arxiv.org/pdf/2404.14219

nguyenbh changed discussion status to closed Jun 4

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment