Silkie / README.md
Zhihui's picture
Add model sources and citation
2616ac8
metadata
datasets:
  - MMInstruction/VLFeedback

Model Card for Silkie

Silkie is a visual language model trained using preference distillation on GPT-4V annotated AI feedback. It is a fine-tuned version of Qwen/Qwen-VL-Chat and was trained on our MMInstruction/VLFeedback dataset with direct preference optimization (DPO). Silkie is a visual language model trained by preference distillation on GPT-4V annotated AI feedback. It is a fine-tuned version of Qwen/Qwen-VL-Chat that is trained on our MMInstruction/VLFeedback dataset with direct preference optimization (DPO). Compared with the original model, Silkile achieves 6.9% and 9.5% relative improvement on the MME benchmark regarding the perception and cognition capabilities, respectively. Besides, Silkie sets a new state-of-the-art score of 3.02 on MMHal-Bench regarding hallucination evaluation. Please refer to our project page for more details.

Model Sources

Uses

Silkie is intended for research purposes, particularly for alignment research in multimodal models.

How to Get Started

Below is a simple Python code snippet to get started with the model. For installation instructions please refer to our Github repository.

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "MMInstruction/Silkie", trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    "MMInstruction/Silkie", device_map="cuda", trust_remote_code=True
).eval()
query = tokenizer.from_list_format(
    [
        {"image": "https://farm8.staticflickr.com/137/383965780_db4815011c_o.jpg"},
        {"text": "Which wooden stool has a vase with red flower on it?"},
    ]
)
response, history = model.chat(tokenizer, query=query, history=None)

Citation

@article{2023vlfeedback,
  author      = {Lei Li and Zhihui Xie and Mukai Li and Shunian Chen and Peiyi Wang and Liang Chen and  Yazheng Yang and  Benyou Wang and  Lingpeng Kong},
  title       = {Silkie: Preference Distillation for Large Visual Language Models},
  publisher   = {arXiv:2312.10665},
  year        = {2023}
}