beomi/Yi-Ko-34B-Chat-Preview (feat. ChatVector)

This repo is naive ChatVector applied using these 3 models:

ko_model_id = 'beomi/Yi-Ko-34B'
chat_model = '01-ai/Yi-34B-Chat'
base_model = '01-ai/Yi-34B'

Benchmarks

Tasks	Version	Filter	n-shot	Metric	Value		Stderr
kobest_boolq	1	none	5	acc	0.9174	±	0.0074
		none	5	f1	0.9174	±	N/A
kobest_copa	1	none	5	acc	0.8620	±	0.0109
		none	5	f1	0.8618	±	N/A
kobest_hellaswag	1	none	5	acc	0.5260	±	0.0224
		none	5	f1	0.5243	±	N/A
		none	5	acc_norm	0.5920	±	0.0220
kobest_sentineg	1	none	5	acc	0.9874	±	0.0056
		none	5	f1	0.9874	±	N/A

Samples

messages = [
    {"role": "system", "content": "친절한 챗봇으로서 상대방의 요청에 최대한 자세하고 친절하게 답하자. 모든 대답은 한국어(Korean)으로 대답해줘."},
    {"role": "user", "content": "이순신 장군에 대해 설명해줘."},
]
---
이순신 장군은 조선 시대에 활약한 유명한 임진왜란의 전략가요. 그가 활약한 시기는 16세기 말이었으며, 이때는 일본이 조선을 침략했던 임진왜란이 발발한 때입니다. 이순신 장군은 이때 일본의 해군으로 구성된 군대가 조선을 공격하자, 조선의 해군에서 활약하며 일본을 상대했습니다. 유명한 해전인 '한산도대첩'과 '명량대첩'에서 이순신 장군은 뛰어난 전략과 용기로 일본을 상대로 크게 승리하는 등 조선을 위기에서 구한 중요한 인물로 평가받고 있습니다. 그의 용기와 지략은 현재까지도 많은 사람들의 귀감으로 전해지고 있습니다<|im_end|>

messages = [
    {"role": "system", "content": "친절한 챗봇으로서 상대방의 요청에 최대한 자세하고 친절하게 답하자. 모든 대답은 한국어(Korean)으로 대답해줘."},
    {"role": "user", "content": "독도와 다케시마의 차이가 뭐야?"},
]
---
"독도"와 "다케시마"는 동일한 섬을 가리키는 이름이지만 서로 다른 나라의 언어로 불리고 있습니다. "독도"는 대한민국 어로 "獨島"라고 부르며, "다케시마"는 일본 어로 "竹島"라고 부릅니다. 이 섬은 일본 돗토리현과 사카이미나토시에 인접한 일본에 가까운 곳에 위치해 있지만, 국제법적으로 대한민국의 영토로 인정받고 있습니다.

대한민국은 이 섬을 독도로 칭하며 역사적으로부터Ownership를 주장하고 있습니다. 역사적 기록에 따르면 독도는 조선시대에 이미 한국의 영토로 인식되었고, 일본 제국주의 시절에 잠시 관할권에 속했다가 일본이 패전한 이후 대한민국 영토로 환원되었습니다.

일본은 이 섬을 다케시마로 칭하며 영유권을 주장하고 있습니다. 하지만, 이러한 주장은 국제법상 인정되지 않으며, 한국 정부는 독도를 대한민국의 영토로서 절대적인 권리를 가지고 있음을 분명히 하고 있습니다.

결론적으로, "독도"와 "다케시마"는 동일한 섬을 가리키지만, 서로 다른 국가의 역사적이고 정치적 상황에 따라 부르는 이름이 다르며, 국제법상 이 섬은 대한민국의 영토로 분류되고 있습니다.<|im_end|>

messages = [
    {"role": "system", "content": "친절한 챗봇으로서 상대방의 요청에 최대한 자세하고 친절하게 답하자. 모든 대답은 한국어(Korean)으로 대답해줘."},
    {"role": "user", "content": "세종대왕의 집현전 맥북 던짐 사건에 대해서 알고 있어?"},
]
---
세종 대왕의 집현전 맥북 던짐 사건에 대해 들어본 적이 있습니다. 이 사건은 2017년 한 방송사의 드라마에서 등장하며 널리 알려졌는데요, 드라마에서는 집현전 학자들이 새로운 문물인 맥북을 보고 세종 대왕이 이를 던진 것으로 묘사되어 논란이 되었습니다.

하지만 이 내용은 역사적 사실과 다르다는 것이 밝혀졌습니다. 집현전 학자들이 실제로 맥북과 같은 외계 문물을 목격했을 리 없고, 세종 대왕이 학자들의 노력을 폄하하거나 물건을 던지는 행동을 했을 것이라는 전제에 문제가 있었습니다. 오히려 세종 대왕은 숙명학을 설립하고 한글을 창제하는 등 학문과 문화에 대한 지원을 아까지 않았던 것으로 알려져 있습니다.

드라마 속에 등장한 사건은 허구적인 설정일뿐, 실제 역사에서 일어난 일은 아닙니다. 이러한 오류를 반영하듯, 이후의 방송에서는 해당 장면이 수정되거나 삭제되었습니다.<|im_end|>

messages = [
    {"role": "system", "content": "친절한 챗봇으로서 상대방의 요청에 최대한 자세하고 친절하게 답하자. 모든 대답은 한국어(Korean)으로 대답해줘."},
    {"role": "user", "content": "나는 오늘 사과 5개를 갖고있어. 그리고 어제 나는 사과 2개를 먹었지. 내가 지금 가진 사과는 몇 개지?"},
]
---
지난 번에는 사과 2개를 먹었고, 지금은 사과 5개를 갖고 있다고 했으니, 지금 가진 사과의 개수는 5개입니다. 그런데, 사과 2개를 먹어서 총 사과를 5개에서 2개 빼면, 남은 사과의 개수는 5 - 2 = 3개입니다. 그러니까, 지금 가진 사과는 3개입니다.<|im_end|>

beomi/Yi-Ko-34B-Preview

Yi-Ko series models serve as advanced iterations of 01-ai/Yi models, benefiting from an expanded vocabulary and the inclusion of Korean/English corpus in its further pretraining. Just like its predecessor, Yi-Ko series models operate within the broad range of generative text models that stretch from 6 billion to 34 billion parameters. This repository focuses on the 34B pretrained version, which is tailored to fit the Hugging Face Transformers format. For access to the other models, feel free to consult the index provided below.

Model Details

Model Developers Junbum Lee (Beomi)

Variations Yi-Ko-34B will come in a range of parameter sizes — 6B and 34B — with Ko(Korean Only)/KoEn(Korean+English)/KoEnCo(Korean+English+Code) variations.

Input Models input text only.

Output Models generate text only.

Model Architecture

Yi-Ko series models are an auto-regressive language model that uses an optimized transformer architecture based on Llama-2*.

*Yi model architecture is based on Llama2, so it can be loaded via LlamaForCausalLM class on HF.

Model Name	Training Data	Params	Context Length	GQA	Trained Tokens	LR	Batch Size(per step)
Yi-Ko-34B	A mix of Korean + English online data	34B	4k	O	TBD(under training)	5e^-5	2048

Vocab Expansion

Model Name	Vocabulary Size	Description
Original Yi-Series	64000	Sentencepiece BPE
Expanded Yi-Ko Series	78464	Sentencepiece BPE. Added Korean vocab and merges

Tokenizing "안녕하세요, 오늘은 날씨가 좋네요.ㅎㅎ"

Model	# of tokens	Tokens
Original Yi-Series	47	`['<0xEC>', '<0x95>', '<0x88>', '<0xEB>', '<0x85>', '<0x95>', '하', '<0xEC>', '<0x84>', '<0xB8>', '<0xEC>', '<0x9A>', '<0x94>', ',', '▁', '<0xEC>', '<0x98>', '<0xA4>', '<0xEB>', '<0x8A>', '<0x98>', '은', '▁', '<0xEB>', '<0x82>', '<0xA0>', '<0xEC>', '<0x94>', '<0xA8>', '가', '▁', '<0xEC>', '<0xA2>', '<0x8B>', '<0xEB>', '<0x84>', '<0xA4>', '<0xEC>', '<0x9A>', '<0x94>', '.', '<0xE3>', '<0x85>', '<0x8E>', '<0xE3>', '<0x85>', '<0x8E>']`
Expanded Yi-Ko Series	10	`['▁안녕', '하세요', ',', '▁오늘은', '▁날', '씨가', '▁좋네요', '.', 'ㅎ', 'ㅎ']`
*Equal Korean vocab with Llama-2-Ko Series

Tokenizing "Llama 2: Open Foundation and Fine-Tuned Chat Models"

Model	# of tokens	Tokens
Original Yi-Series	21	`['The', '▁Y', 'i', '▁series', '▁models', '▁are', '▁large', '▁language', '▁models', '▁trained', '▁from', '▁scratch', '▁by', '▁developers', '▁at', '▁', '0', '1', '.', 'AI', '.']`
Expanded Yi-Ko Series	21	`['▁The', '▁Y', 'i', '▁series', '▁models', '▁are', '▁large', '▁language', '▁models', '▁trained', '▁from', '▁scratch', '▁by', '▁developers', '▁at', '▁', '0', '1', '.', 'AI', '.']`
*Equal Korean vocab with Llama-2-Ko Series		Since Expanded Yi-Ko Series* prepends `_` at the beginning of the text(to ensure same tokenization for Korean sentences), it shows negilible difference for the first token on English tokenization.

Model Benchmark

LM Eval Harness - Korean (polyglot branch)

TBD

LICENSE

CC-BY-NC-SA-4.0 under Yi license

Citation

TBD

Acknowledgement

The training is supported by TPU Research Cloud program.