---
license: mit
datasets:
- conceptual_12m
- HuggingFaceM4/COCO
- visual_genome
language:
- ja
- en
pipeline_tag: image-text-to-text
---


# bilingual-gpt-neox-4b-minigpt4

![rinna-icon](./rinna.png)

# Overview
This repository provides an English-Japanese bilingual multimodal conversational model like MiniGPT-4 by combining GPT-NeoX model of 3.8 billion parameters and BLIP-2.

The model is based on [`rinna/bilingual-gpt-neox-4b`](https://huggingface.co/rinna/bilingual-gpt-neox-4b) and [BLIP-2](https://huggingface.co/docs/transformers/main/model_doc/blip-2).

* **Model architecture**

    Similar with [BLIP-2](https://huggingface.co/docs/transformers/main/model_doc/blip-2) and [Vision-CAIR/MiniGPT-4](https://huggingface.co/Vision-CAIR/MiniGPT-4), the model consists of an LLM, vision-encoder with ViT and Q-Former, and linear-layer for connecting the LLM and vision-encoder.

    [`rinna/bilingual-gpt-neox-4b`](https://huggingface.co/rinna/bilingual-gpt-neox-4b) (A 36-layer, 2816-hidden-size transformer-based language model) is used as the LLM instead of [Vicuna](https://github.com/lm-sys/FastChat), which is used in the original [Vision-CAIR/MiniGPT-4](https://huggingface.co/Vision-CAIR/MiniGPT-4).

* **Finetuning**
    
    The finetuning data is the subset of the following datasets.
    * English datasets
      * [Conceptual 12M (CC12M)](https://huggingface.co/datasets/conceptual_12m)
      * [COCO 2014](https://huggingface.co/datasets/HuggingFaceM4/COCO)
      * [Visual Genome](https://huggingface.co/datasets/visual_genome)
    * Japanese datasets
      * [STAIR-captions](https://github.com/STAIR-Lab-CIT/STAIR-captions)
      * [Japanese Visual Genome VQA dataset](https://github.com/yahoojapan/ja-vg-vqa)

    Based on the implementation of [Vision-CAIR/MiniGPT-4](https://huggingface.co/Vision-CAIR/MiniGPT-4), only "first pretraining stage" described in [MiniGPT-4 paper](https://arxiv.org/abs/2304.10592) with the above datasets was conducted, and "second-stage finetuning" proposed in the paper with an aligned image-text dataset created with ChatGPT was NOT conducted.
  
* **Model Series**

    | Variant | Link |
    | :-- | :--|
    | Bilingual 4B MiniGPT4 | https://huggingface.co/rinna/bilingual-gpt-neox-4b-minigpt4 |
    | Bilingual 4B PPO | https://huggingface.co/rinna/bilingual-gpt-neox-4b-instruction-ppo |
    | Bilingual 4B SFT | https://huggingface.co/rinna/bilingual-gpt-neox-4b-instruction-sft |
    | Bilingual 4B 8K | https://huggingface.co/rinna/bilingual-gpt-neox-4b-8k |
    | Bilingual 4B | https://huggingface.co/rinna/bilingual-gpt-neox-4b |
    | Japanese 3.6B PPO | https://huggingface.co/rinna/japanese-gpt-neox-3.6b-instruction-ppo |
    | Japanese 3.6B SFT-v2 | https://huggingface.co/rinna/japanese-gpt-neox-3.6b-instruction-sft-v2 |
    | Japanese 3.6B SFT | https://huggingface.co/rinna/japanese-gpt-neox-3.6b-instruction-sft |
    | Japanese 3.6B | https://huggingface.co/rinna/japanese-gpt-neox-3.6b |

* **Contributors**
    
    [Koh Mitsuda](https://huggingface.co/mitsu-koh), [Tianyu Zhao](https://huggingface.co/tianyuz), and [Kei Sawada](https://huggingface.co/keisawada)

---

# I/O Format
A special format has been adopted to construct inputs.
* An input prompt is formatted as a conversation between `ユーザー` and `システム`.
* Each input utterance consists of (1) its speaker (`"ユーザー"` or `"システム"`), (2) a colon (`":"`), (3) a whitespace (`" "`), and (4) utterance text (e.g. `"猫はどんな体勢をしていますか？"`).
* An utterance including an image is formatted as (1) its speaker (`"ユーザー"`), (2) a colon (`":"`), (3) a whitespace (`" "`), (4) a placeholder of the image (`"<Img><ImageHere></Img>"`), (5) another whitespace (`" "`), (6) utterance text (e.g. `"What can you see?"`).  
  * The placeholder (`<ImageHere>`) is automatically replaced with the embedding of an input image in the function `get_context_emb`.
* The input prompt should be ended with `"システム: "` to acknowledge the model to generate a response.
* All the utterances in the input prompt should be separated by a newline `\n`.

Following is an example to construct input from a conversation.
~~~python
prompt = [
    {
        "speaker": "ユーザー",
        "text": "<Img><ImageHere></Img> What can you see?"
    },
    {
        "speaker": "システム",
        "text": "a cat on a table with a laptop"
    },
    {
        "speaker": "ユーザー",
        "text": "猫はどんな体勢をしていますか？"
    },
]
prompt = [
    f"{uttr['speaker']}: {uttr['text']}"
    for uttr in prompt
]
prompt = "\n".join(prompt)
prompt = (
    prompt
    + "\n"
    + "システム: "
)
print(prompt)
"""
ユーザー: <Img><ImageHere></Img> What can you see?
システム: a cat on a table with a laptop
ユーザー: 猫はどんな体勢をしていますか？
システム: 
"""
~~~

---

# How to use the model

**1. Download dependencies**

* BLIP-2 implementation included in MiniGPT-4 is used for inference.
* `customized_mini_gpt4.py` is a script to replace LLM from LLaMA architecture to GPT-NeoX one.
* `checkpoint.pth` is a finetuned weight of the linear layer (file size: 177 MB).

```bash
git clone https://github.com/Vision-CAIR/MiniGPT-4.git
cd ./MiniGPT-4
git checkout 22d8888 # latest version as of July 31, 2023.
wget https://huggingface.co/rinna/bilingual-gpt-neox-4b-minigpt4/resolve/main/customized_mini_gpt4.py
wget https://huggingface.co/rinna/bilingual-gpt-neox-4b-minigpt4/resolve/main/checkpoint.pth
```

**2. Inference**

Please run this script in `MiniGPT-4` directory.

~~~~python
import torch
import requests
from PIL import Image
from minigpt4.processors.blip_processors import Blip2ImageEvalProcessor
from customized_mini_gpt4 import CustomizedMiniGPT4

ckpt_path = "./checkpoint.pth"

model = CustomizedMiniGPT4(gpt_neox_model="rinna/bilingual-gpt-neox-4b")
tokenizer = model.gpt_neox_tokenizer

if torch.cuda.is_available():
    model = model.to("cuda")

if ckpt_path is not None:
    print("Load BLIP2-LLM Checkpoint: {}".format(ckpt_path))
    ckpt = torch.load(ckpt_path, map_location="cpu")
    model.load_state_dict(ckpt['model'], strict=False)

vis_processor = Blip2ImageEvalProcessor()

image_url = "https://huggingface.co/rinna/bilingual-gpt-neox-4b-minigpt4/resolve/main/sample.jpg"
raw_image = Image.open(requests.get(image_url, stream=True).raw).convert('RGB')
image = vis_processor(raw_image).unsqueeze(0).to(model.device)
image_emb = model.encode_img(image)

embs = model.get_context_emb(prompt, [image_emb])

output_ids = model.gpt_neox_model.generate(
    inputs_embeds=embs,
    max_new_tokens=512,
    do_sample=True,
    temperature=1.0,
    top_p=0.85,
    pad_token_id=tokenizer.pad_token_id,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id
)

output = tokenizer.decode(output_ids.tolist()[0], skip_special_tokens=True)
print(output)
"""横になっています。"""
~~~~

---

# How to cite
```bibtex
@misc{rinna-bilingual-gpt-neox-4b-minigpt4,
    title = {rinna/bilingual-gpt-neox-4b-minigpt4},
    author = {Mitsuda, Koh and Zhao, Tianyu and Sawada, Kei},
    url = {https://huggingface.co/rinna/bilingual-gpt-neox-4b-minigpt4}
}

@inproceedings{sawada2024release,
    title = {Release of Pre-Trained Models for the {J}apanese Language},
    author = {Sawada, Kei and Zhao, Tianyu and Shing, Makoto and Mitsui, Kentaro and Kaga, Akio and Hono, Yukiya and Wakatsuki, Toshiaki and Mitsuda, Koh},
    booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
    month = {5},
    year = {2024},
    pages = {13898--13905},
    url = {https://aclanthology.org/2024.lrec-main.1213},
    note = {\url{https://arxiv.org/abs/2404.01657}}
}
```

---

# Acknowledgement

* [Vision-CAIR/MiniGPT-4](https://huggingface.co/Vision-CAIR/MiniGPT-4)
  * [BLIP-2](https://huggingface.co/docs/transformers/main/model_doc/blip-2)
  * [Lavis](https://github.com/salesforce/LAVIS)

# Licenese
[The MIT license](https://opensource.org/licenses/MIT)