File size: 9,704 Bytes
b885bbc |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 |
---
license: apache-2.0
language:
- ko
- en
base_model:
- openchat/openchat_3.5
pipeline_tag: text-generation
datasets:
- AIDX-ktds/ko_leaderboard
---
### β± ν΄λΉ λͺ¨λΈμμ openchat3.5 μ Foundation λͺ¨λΈλ‘ νλ νκ΅μ΄ λ° νκ΅μ λ€μν
### λ¬Ένμ μ μ©ν μ μλλ‘ νκΈ° μν΄κ°λ° λμμΌλ©°
### μ체 μ μν 53μμμ νκ΅μ΄ λ°μ΄ν°λ₯Ό νμ©νμ¬ νκ΅ μ¬ν κ°μΉμ
### λ¬Ένλ₯Ό μ΄ν΄νλ λͺ¨λΈ μ
λλ€. β
# βΆ λͺ¨λΈ μ€λͺ
- λͺ¨λΈλͺ
λ° μ£ΌμκΈ°λ₯:
ν΄λΉ λͺ¨λΈμμ OpenChat 3.5 λͺ¨λΈμ κΈ°λ°μΌλ‘ SFT λ°©μμΌλ‘ νμΈνλλ Mistral 7B / openchat3.5 κΈ°λ° λͺ¨λΈμ
λλ€.
νκ΅μ΄μ νκ΅μ λ€μν λ¬Ένμ λ§₯λ½μ μ΄ν΄νλλ‘ μ€κ³λμμΌλ©° β¨β¨, μ체 μ μν 53κ° μμμ νκ΅μ΄
λ°μ΄ν°λ₯Ό νμ©ν΄ νκ΅ μ¬νμ κ°μΉμ λ¬Ένλ₯Ό λ°μν©λλ€.
μ£Όμ κΈ°λ₯μΌλ‘λ ν
μ€νΈ μμ±, λν μΆλ‘ , λ¬Έμ μμ½, μ§μμλ΅, κ°μ λΆμ λ° μμ°μ΄ μ²λ¦¬ κ΄λ ¨ λ€μν μμ
μ μ§μνλ©°,
νμ© λΆμΌλ λ²λ₯ , μ¬λ¬΄, κ³Όν, κ΅μ‘, λΉμ¦λμ€, λ¬Έν μ°κ΅¬ λ± λ€μν λΆμΌμμ μμ©λ μ μμ΅λλ€.
- λͺ¨λΈ μν€ν
μ²:ν΄λΉ λͺ¨λΈμμ Mistral 7B λͺ¨λΈμ κΈ°λ°μΌλ‘, νλΌλ―Έν° μλ 70μ΅ κ°(7B)λ‘ κ΅¬μ±λ κ³ μ±λ₯ μΈμ΄ λͺ¨λΈμ
λλ€.
μ΄ λͺ¨λΈμ OpenChat 3.5λ₯Ό νμ΄λ°μ΄μ
λͺ¨λΈλ‘ μΌμ, SFT(μ§λ λ―ΈμΈ μ‘°μ ) λ°©μμ ν΅ν΄ νκ΅μ΄μ νκ΅ λ¬Ένμ νΉνλ μ±λ₯μ λ°ννλλ‘ νλ ¨λμμ΅λλ€.
Mistral 7Bμ κ²½λνλ ꡬ쑰λ λΉ λ₯Έ μΆλ‘ μλμ λ©λͺ¨λ¦¬ ν¨μ¨μ±μ 보μ₯νλ©°, λ€μν μμ°μ΄ μ²λ¦¬ μμ
μ μ ν©νκ² μ΅μ νλμ΄ μμ΅λλ€.
μ΄ μν€ν
μ²λ ν
μ€νΈ μμ±, μ§μμλ΅, λ¬Έμ μμ½, κ°μ λΆμκ³Ό κ°μ λ€μν μμ
μμ νμν μ±λ₯μ 보μ¬μ€λλ€.
# β· νμ΅ λ°μ΄ν°
- ν΄λΉ λͺ¨λΈμμ μ체 κ°λ°ν μ΄ 3.6GB ν¬κΈ°μ λ°μ΄ν°λ₯Ό λ°νμΌλ‘ νμ΅λμμ΅λλ€. λͺ¨λ 233λ§ κ±΄μ QnA, μμ½, λΆλ₯ λ± λ°μ΄ν°λ₯Ό ν¬ν¨νλ©°,
κ·Έ μ€ 133λ§ κ±΄μ 53κ° μμμ κ°κ΄μ λ¬Έμ λ‘ κ΅¬μ±λμμ΅λλ€. μ΄ μμμλ νκ΅μ¬, μ¬ν, μ¬λ¬΄, λ²λ₯ , μΈλ¬΄, μν, μλ¬Ό, 물리, νν λ±μ΄ ν¬ν¨λλ©°,
Chain of Thought λ°©μμΌλ‘ νμ΅λμμ΅λλ€. λν 130λ§ κ±΄μ μ£Όκ΄μ λ¬Έμ λ νκ΅μ¬, μ¬λ¬΄, λ²λ₯ , μΈλ¬΄, μν λ± 38κ° μμμ κ±Έμ³ νμ΅λμμ΅λλ€.
νμ΅ λ°μ΄ν° μ€ νκ΅μ μ¬ν κ°μΉμ μΈκ°μ κ°μ μ μ΄ν΄νκ³ μ§μν μ¬νμ λ°λΌ μΆλ ₯ν μ μλ λ°μ΄ν°λ₯Ό νμ΅νμμ΅λλ€.
- νμ΅ Instruction Datasets Format:
<pre><code>{"prompt": "prompt text", "completion": "ideal generated text"}</code></pre>
# βΈ μ¬μ© μ¬λ‘
ν΄λΉ λͺ¨λΈμ λ€μν μμ© λΆμΌμμ μ¬μ©λ μ μμ΅λλ€. μλ₯Ό λ€μ΄:
- κ΅μ‘ λΆμΌ: μμ¬, μν, κ³Όν λ± λ€μν νμ΅ μλ£μ λν μ§μμλ΅ λ° μ€λͺ
μμ±.
- λΉμ¦λμ€: λ²λ₯ , μ¬λ¬΄, μΈλ¬΄ κ΄λ ¨ μ§μμ λν λ΅λ³ μ 곡 λ° λ¬Έμ μμ½.
- μ°κ΅¬ λ° λ¬Έν: νκ΅ μ¬νμ λ¬Ένμ λ§μΆ μμ°μ΄ μ²λ¦¬ μμ
, κ°μ λΆμ, λ¬Έμ μμ± λ° λ²μ.
- κ³ κ° μλΉμ€: μ¬μ©μμμ λν μμ± λ° λ§μΆ€ν μλ΅ μ 곡.
- μ΄ λͺ¨λΈμ νκ΅ μ¬νμ λν μ΄ν΄μ λ€μν μμ°μ΄ μ²λ¦¬ μμ
μμ λμ νμ©λλ₯Ό κ°μ§λλ€.
- μμ
1. λ
λλ μ΄λ λ
μΈκ°?
λ
λλ λνλ―Όκ΅μ μν λ‘, λ
λλ νκ΅μ λ
λλΌλ μ΄λ¦μΌλ‘λ λΆλ¦¬λ©°, λ
λλ λ
λμ μμΉν λ
λλΌλ μ¬μ μ΄λ¦μ
λλ€. λ
λλ λ
λμ μμΉν λ
λλΌλ μ¬μ μ΄λ¦μ
λλ€.
2. νκ΅ λμͺ½μ μλ λ°λ€λ₯Ό 무μμ΄λΌκ³ νλκ°?
νκ΅ λμͺ½μ μλ λ°λ€λ 'λν΄'λΌκ³ ν©λλ€. λν΄λ νκ΅μ λμͺ½μ μμΉν λ°λ€λ‘,
λΆνκ³Ό κ²½κ³λ₯Ό λ§λκ³ μμ΅λλ€. λν΄λ νκ΅μ μ€μν ν΄μ κ΅ν΅λ‘ μ€ νλλ‘, ν΄μκ΅ν΅κ³Ό μ΄μ
μ μ€μν μν μ νκ³ μμ΅λλ€.
3. 20μΈκΈ° μ΄λ° μΌλ³Έμ΄ μ‘°μ μ μλ―Όμ§ν ν κ²μ λν΄μ μ΄λ‘ν΄ μκ°νλμ§?
μ‘°μ μ 19μΈκΈ° μ€λ°μ μΌλ³Έμ μλ―Όμ§νλ₯Ό λ°μλ€. μ΄λ μΌλ³Έμ κ΅°μ¬μ , κ²½μ μ κ°λ ₯μ±κ³Ό μ μΉμ μΉ¨μ
μΌλ‘ μΈν΄ λ°μνλ€.
μ‘°μ μ μΌλ³Έμ κ΅°μ¬μ μΉ¨μ
μ ν볡νκ³ μλ―Όμ§ κ΄κ³κ° μμλμλ€. μ΄λ¬ν μν©μμ μ‘°μ κ΅λ―Όλ€μ ν° λΆμκ°κ³Ό μ’μ κ°μ λκΌμ κ²μ΄λ€.
κ·Έλ¬λ μΌμ νκΈ°μλ μΌλ³Έμ μλ―Όμ§ν 체μ κ° μ μ°¨ μ½νλλ©΄μ μ‘°μ κ΅λ―Όλ€μ λ ν° μμ μ λ
립μ μΆκ΅¬νκ² λλ€.
μ΄λ¬ν μμ¬μ λ°°κ²½μ ν΅ν΄ μ‘°μ κ΅λ―Όλ€μ μλ―Όμ§νμ λν΄ λ§€μ° λΆμ μ μΈ νλλ₯Ό 보μμ κ²μ΄λ€.
4. μμ€κ·Ό μμ¬κ° μ΄ν νλ‘λΆλ―Έλ₯Ό μ 격ν μ¬κ±΄μ μ΄λ»κ² μκ°νλκ°?
μμ€κ·Ό μμ¬λ 1909λ
4μ 27μΌμ μ΄ν νλ‘λΆλ―Έλ₯Ό μ 격νλ€. κ·Έλ μΌλ³Έ μ κ΅μ£Όμ μ μΉμ κ΅°μ¬μ νλμ λν΄ λ°λνλ©°, μΌλ³Έμ 무λ ₯ μ§λ°°λ₯Ό λ§κΈ° μν΄ μ΄ν λ₯Ό 곡격νλ€.
μμ€κ·Όμ νκ΅ λ΄μμ λ
립μ΄λκ°λ‘ μλ €μ Έ μμΌλ©°, κ·Έμ νμλ νκ΅ λ΄ λ
립μ΄λμ μ€μν μ¬κ±΄ μ€ νλλ‘ μ¬κ²¨μ§λ€.
μμ€κ·Όμ 1946λ
μ μ΅μ΄μ λ
립μ΄λκ°λ‘ μΈμ λ°μκ³ , κ·Έμ ν보λ λ§μ λ
립μ΄λκ°λ€μκ² μκ°μ μ€λ€.
5. νκ΅ μ¬νμμ 곡λ체 μμκ³Ό νλμ κ°μΉλ₯Ό μ΄λ»κ² μ€μνκ² μκ°νμλκΉ?
μ΄λ λ§€μ° μ€μν©λλ€. νκ΅μ μ ν΅μ μΌλ‘ 곡λ체 μμμ΄ κ°νκ³ , κ°μ‘±κ³Ό μ§μ μ¬νμμ νλμ μ€μνλ λ¬Ένκ° κΉμ΅λλ€.
μ΄λ¬ν κ°μΉλ μ¬μ ν νμ¬ μ¬νμμ μ€μν μν μ νλ©°, νΉν λ
ΈμΈ 보νΈμ κ°μ μ¬νμ λ¬Έμ μμ ν° λμμ΄ λ©λλ€.
λν, μ΄λ¬ν κ°μΉλ κ°μΈμ ν볡과 μμ κ°μ μ¦μ§μν€κΈ°λ ν©λλ€. λ°λΌμ μ΄λ¬ν κ°μΉλ₯Ό μ μ§νκ³ λ°μ μν€λ κ²μ νκ΅ μ¬νμ μ€μν λͺ©νμ
λλ€.
# βΉ νκ³ ββ
- ν΄λΉ λͺ¨λΈμ νκ΅μ΄μ νκ΅ λ¬Ένμ νΉνλμ΄ μμΌλ,
νΉμ μμ(μ: μ΅μ κ΅μ μλ£, μ λ¬Έ λΆμΌ)μ λ°μ΄ν° λΆμ‘±μΌλ‘ μΈν΄ λ€λ₯Έ μΈμ΄ λλ
λ¬Ένμ λν μλ΅μ μ νμ±μ΄ λ¨μ΄μ§ μ μμ΅λλ€.
λν, 볡μ‘ν λ
Όλ¦¬μ μ¬κ³ λ₯Ό μꡬνλ λ¬Έμ μ λν΄ μ νλ μΆλ‘ λ₯λ ₯μ λ³΄μΌ μ μμΌλ©°,
νΈν₯λ λ°μ΄ν°κ° ν¬ν¨λ κ²½μ° νΈν₯λ μλ΅μ΄ μμ±λ κ°λ₯μ±λ μ‘΄μ¬ν©λλ€.
# βΊ μ¬μ© λ°©λ²
<pre><code>
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("SEOKDONG/openchat3.5_korean_v1.0_sft")
model = AutoModel.from_pretrained("SEOKDONG/openchat3.5_korean_v1.0_sft")
input_text = """ γκ΅λ―Όκ±΄κ°λ³΄νλ²γμ 44μ‘°, γκ΅λ―Όκ±΄κ°λ³΄νλ² μνλ Ήγμ 19μ‘°,γμ½κ΄μ κ·μ μ κ΄ν λ²λ₯ γμ 5μ‘°, γμλ²γμ 54μ‘° μ°Έμ‘° νλ¨ ν΄μ€""" + " λ΅λ³:"
inputs = tokenizer(input_text, return_tensors="pt")
with torch.no_grad():
outputs = model.generate(**inputs, max_length=1024, temperature=0.5, do_sample=True, repetition_penalty=1.15)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
</code></pre>
---
Hereβs the English version of the provided text:
# βΆ Model Description
**Model Name and Key Features**:
This Model is based on the OpenChat 3.5 model, fine-tuned using the SFT method on the Mistral 7B model.
It is designed to understand Korean and various cultural contexts, utilizing data from 135 domains in Korean society.
The model supports tasks such as text generation, conversation inference, document summarization,
question answering, sentiment analysis, and other NLP tasks.
Its applications span fields like law, finance, science, education, business, and cultural research.
**Model Architecture**:
This Model is a high-performance language model with 7 billion parameters based on the Mistral 7B model.
It uses OpenChat 3.5 as the foundation and is fine-tuned using SFT to excel in Korean language and culture.
The streamlined Mistral 7B architecture ensures fast inference and memory efficiency,
optimized for various NLP tasks like text generation, question answering, document summarization, and sentiment analysis.
---
# β· Training Data
This Model was trained on 3.6GB of data, comprising 2.33 million Q&A instances.
This includes 1.33 million multiple-choice questions across 53 domains such as history,
finance, law, tax, and science, trained with the Chain of Thought method. Additionally,
1.3 million short-answer questions cover 38 domains including history, finance, and law.
**Training Instruction Dataset Format**:
`{"prompt": "prompt text", "completion": "ideal generated text"}`
---
# βΈ Use Cases
This Model can be used across multiple fields, such as:
- **Education**: Answering questions and generating explanations for subjects like history, math, and science.
- **Business**: Providing responses and summaries for legal, financial, and tax-related queries.
- **Research and Culture**: Performing NLP tasks, sentiment analysis, document generation, and translation.
- **Customer Service**: Generating conversations and personalized responses for users.
This model is highly versatile in various NLP tasks.
---
# βΉ Limitations
This Model is specialized in Korean language and culture.
However, it may lack accuracy in responding to topics outside its scope,
such as international or specialized data.
Additionally, it may have limited reasoning ability for complex logical problems and
may produce biased responses if trained on biased data. |