---
library_name: transformers
license: apache-2.0
language:
- en
- ko
base_model:
- meta-llama/Meta-Llama-3-8B
---
# Bllossom | [Demo](https://c537bba37aaab5fc9e.gradio.live) | [Homepage](https://www.bllossom.ai/) |
The Bllossom language model is a Korean-English bilingual language model based on the open-source LLama3. It enhances the connection of knowledge between Korean and English. It has the following features:
* **Knowledge Linking**: Linking Korean and English knowledge through additional training
* **Vocabulary Expansion**: Expansion of Korean vocabulary to enhance Korean expressiveness.
* **Instruction Tuning**: Tuning using custom-made instruction following data specialized for Korean language and Korean culture
* **Human Feedback**: DPO has been applied
* **Vision-Language Alignment**: Aligning the vision transformer with this language model
**This model devel by [MLPLab at Seoultech](http://mlp.seoultech.ac.kr), [Teddysum](http://teddysum.ai/) and [Yonsei Univ](https://sites.google.com/view/hansaemkim/hansaem-kim)**
## video
## NEWS
* [2024/04] We released Bllossom v2.0, based on llama-3
* [2023/12] We released Bllossom-Vision v1.0, based on Bllossom
* [2023/08] We released Bllossom v1.0, based on llama-2.
* [2023/07] We released Bllossom v0.7, based on polyglot-ko.
## Example code
### Install Dependencies
```bash
pip install torch transformers==4.40.0 accelerate
```
### Python code with Pipeline
```python
import transformers
import torch
model_id = "MLP-KTLim/Bllossom"
pipeline = transformers.pipeline(
"text-generation",
model=model_id,
model_kwargs={"torch_dtype": torch.bfloat16},
device_map="auto",
)
pipeline.model.eval()
PROMPT = '''당신은 유용한 AI 어시스턴트입니다. 사용자의 질의에 대해 친절하고 정확하게 답변해야 합니다.'''
instruction = "서울과학기술대학교 MLP연구실에 대해 소개해줘"
messages = [
{"role": "system", "content": f"{PROMPT}"},
{"role": "user", "content": f"{instruction}"}
]
prompt = pipeline.tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
terminators = [
pipeline.tokenizer.eos_token_id,
pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
outputs = pipeline(
prompt,
max_new_tokens=2048,
eos_token_id=terminators,
do_sample=True,
temperature=0.6,
top_p=0.9,
)
print(outputs[0]["generated_text"][len(prompt):])
# 서울과학기술대학교 MLP연구실은 멀티모달 자연어처리 연구를 하고 있습니다. 구성원은 임경태 교수와 김민준, 김상민, 최창수, 원인호, 유한결, 임현석, 송승우, 육정훈, 신동재 학생이 있습니다.
```
### Python code with AutoModel
```python
import os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = 'MLP-KTLim/Bllossom'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
model.eval()
PROMPT = '''당신은 유용한 AI 어시스턴트입니다. 사용자의 질의에 대해 친절하고 정확하게 답변해야 합니다.'''
instruction = "서울과학기술대학교 MLP연구실에 대해 소개해줘"
messages = [
{"role": "system", "content": f"{PROMPT}"},
{"role": "user", "content": f"{instruction}"}
]
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
terminators = [
tokenizer.eos_token_id,
tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
outputs = model.generate(
input_ids,
max_new_tokens=2048,
eos_token_id=terminators,
do_sample=True,
temperature=0.6,
top_p=0.9,
repetition_penalty = 1.1
)
print(tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True))
# 서울과학기술대학교 MLP연구실은 멀티모달 자연어처리 연구를 하고 있습니다. 구성원은 임경태 교수와 김민준, 김상민, 최창수, 원인호, 유한결, 임현석, 송승우, 육정훈, 신동재 학생이 있습니다.
```
## Citation
**Language Model**
```text
@misc{bllossom,
author = {ChangSu Choi, Yongbin Jeong, Seoyoon Park, InHo Won, HyeonSeok Lim, SangMin Kim, Yejee Kang, Chanhyuk Yoon, Jaewan Park, Yiseul Lee, HyeJin Lee, Younggyun Hahm, Hansaem Kim, KyungTae Lim},
title = {Optimizing Language Augmentation for Multilingual Large Language Models: A Case Study on Korean},
year = {2024},
journal = {LREC-COLING 2024},
paperLink = {\url{https://arxiv.org/pdf/2403.10882}},
},
}
```
**Vision-Language Model**
```text
@misc{bllossom,
author = {Dongjae Shin, Hyunseok Lim, Inho Won, Changsu Choi, Minjun Kim, Seungwoo Song, Hangyeol Yoo, Sangmin Kim, Kyungtae Lim},
title = {X-LLaVA: Optimizing Bilingual Large Vision-Language Alignment},
year = {2024},
publisher = {GitHub},
journal = {NAACL 2024 findings},
paperLink = {\url{https://arxiv.org/pdf/2403.11399}},
},
}
```
## Contact
- 임경태(KyungTae Lim), Professor at Seoultech. `ktlim@seoultech.ac.kr`
- 함영균(Younggyun Hahm), CEO of Teddysum. `hahmyg@teddysum.ai`