mav23 commited on
Commit
b885bbc
β€’
1 Parent(s): aa81bde

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ openchat3.5_korean_v1.0_sft.Q4_0.gguf filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,153 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - ko
5
+ - en
6
+ base_model:
7
+ - openchat/openchat_3.5
8
+ pipeline_tag: text-generation
9
+ datasets:
10
+ - AIDX-ktds/ko_leaderboard
11
+ ---
12
+
13
+
14
+ ### β›± ν•΄λ‹Ή λͺ¨λΈμ€μ€ openchat3.5 을 Foundation λͺ¨λΈλ‘œ ν•˜λŠ” ν•œκ΅­μ–΄ 및 ν•œκ΅­μ˜ λ‹€μ–‘ν•œ
15
+ ### 문화에 μ μš©ν•  수 μžˆλ„λ‘ ν•˜κΈ° μœ„ν•΄κ°œλ°œ λ˜μ—ˆμœΌλ©°
16
+ ### 자체 μ œμž‘ν•œ 53μ˜μ—­μ˜ ν•œκ΅­μ–΄ 데이터λ₯Ό ν™œμš©ν•˜μ—¬ ν•œκ΅­ μ‚¬νšŒ κ°€μΉ˜μ™€
17
+ ### λ¬Έν™”λ₯Ό μ΄ν•΄ν•˜λŠ” λͺ¨λΈ μž…λ‹ˆλ‹€. ✌
18
+
19
+
20
+
21
+ # ❢ λͺ¨λΈ μ„€λͺ…
22
+ - λͺ¨λΈλͺ… 및 μ£Όμš”κΈ°λŠ₯:
23
+ ν•΄λ‹Ή λͺ¨λΈμ€μ€ OpenChat 3.5 λͺ¨λΈμ„ 기반으둜 SFT λ°©μ‹μœΌλ‘œ νŒŒμΈνŠœλ‹λœ Mistral 7B / openchat3.5 기반 λͺ¨λΈμž…λ‹ˆλ‹€.
24
+ ν•œκ΅­μ–΄μ™€ ν•œκ΅­μ˜ λ‹€μ–‘ν•œ 문화적 λ§₯락을 μ΄ν•΄ν•˜λ„λ‘ μ„€κ³„λ˜μ—ˆμœΌλ©° ✨✨, 자체 μ œμž‘ν•œ 53개 μ˜μ—­μ˜ ν•œκ΅­μ–΄
25
+ 데이터λ₯Ό ν™œμš©ν•΄ ν•œκ΅­ μ‚¬νšŒμ˜ κ°€μΉ˜μ™€ λ¬Έν™”λ₯Ό λ°˜μ˜ν•©λ‹ˆλ‹€.
26
+ μ£Όμš” κΈ°λŠ₯μœΌλ‘œλŠ” ν…μŠ€νŠΈ 생성, λŒ€ν™” μΆ”λ‘ , λ¬Έμ„œ μš”μ•½, μ§ˆμ˜μ‘λ‹΅, 감정 뢄석 및 μžμ—°μ–΄ 처리 κ΄€λ ¨ λ‹€μ–‘ν•œ μž‘μ—…μ„ μ§€μ›ν•˜λ©°,
27
+ ν™œμš© λΆ„μ•ΌλŠ” 법λ₯ , 재무, κ³Όν•™, ꡐ윑, λΉ„μ¦ˆλ‹ˆμŠ€, λ¬Έν™” 연ꡬ λ“± λ‹€μ–‘ν•œ λΆ„μ•Όμ—μ„œ μ‘μš©λ  수 μžˆμŠ΅λ‹ˆλ‹€.
28
+ - λͺ¨λΈ μ•„ν‚€ν…μ²˜:ν•΄λ‹Ή λͺ¨λΈμ€μ€ Mistral 7B λͺ¨λΈμ„ 기반으둜, νŒŒλΌλ―Έν„° μˆ˜λŠ” 70μ–΅ 개(7B)둜 κ΅¬μ„±λœ κ³ μ„±λŠ₯ μ–Έμ–΄ λͺ¨λΈμž…λ‹ˆλ‹€.
29
+ 이 λͺ¨λΈμ€ OpenChat 3.5λ₯Ό νŒŒμš΄λ°μ΄μ…˜ λͺ¨λΈλ‘œ μ‚Όμ•„, SFT(지도 λ―Έμ„Έ μ‘°μ •) 방식을 톡해 ν•œκ΅­μ–΄μ™€ ν•œκ΅­ 문화에 νŠΉν™”λœ μ„±λŠ₯을 λ°œνœ˜ν•˜λ„λ‘ ν›ˆλ ¨λ˜μ—ˆμŠ΅λ‹ˆλ‹€.
30
+ Mistral 7B의 κ²½λŸ‰ν™”λœ κ΅¬μ‘°λŠ” λΉ λ₯Έ μΆ”λ‘  속도와 λ©”λͺ¨λ¦¬ νš¨μœ¨μ„±μ„ 보μž₯ν•˜λ©°, λ‹€μ–‘ν•œ μžμ—°μ–΄ 처리 μž‘μ—…μ— μ ν•©ν•˜κ²Œ μ΅œμ ν™”λ˜μ–΄ μžˆμŠ΅λ‹ˆλ‹€.
31
+ 이 μ•„ν‚€ν…μ²˜λŠ” ν…μŠ€νŠΈ 생성, μ§ˆμ˜μ‘λ‹΅, λ¬Έμ„œ μš”μ•½, 감정 뢄석과 같은 λ‹€μ–‘ν•œ μž‘μ—…μ—μ„œ νƒμ›”ν•œ μ„±λŠ₯을 λ³΄μ—¬μ€λ‹ˆλ‹€.
32
+
33
+ # ❷ ν•™μŠ΅ 데이터
34
+ - ν•΄λ‹Ή λͺ¨λΈμ€μ€ 자체 κ°œλ°œν•œ 총 3.6GB 크기의 데이터λ₯Ό λ°”νƒ•μœΌλ‘œ ν•™μŠ΅λ˜μ—ˆμŠ΅λ‹ˆλ‹€. λͺ¨λ‘ 233만 건의 QnA, μš”μ•½, λΆ„λ₯˜ λ“± 데이터λ₯Ό ν¬ν•¨ν•˜λ©°,
35
+ κ·Έ 쀑 133만 건은 53개 μ˜μ—­μ˜ 객관식 문제둜 κ΅¬μ„±λ˜μ—ˆμŠ΅λ‹ˆλ‹€. 이 μ˜μ—­μ—λŠ” ν•œκ΅­μ‚¬, μ‚¬νšŒ, 재무, 법λ₯ , 세무, μˆ˜ν•™, 생물, 물리, ν™”ν•™ 등이 ν¬ν•¨λ˜λ©°,
36
+ Chain of Thought λ°©μ‹μœΌλ‘œ ν•™μŠ΅λ˜μ—ˆμŠ΅λ‹ˆλ‹€. λ˜ν•œ 130만 건의 주관식 λ¬Έμ œλŠ” ν•œκ΅­μ‚¬, 재무, 법λ₯ , 세무, μˆ˜ν•™ λ“± 38개 μ˜μ—­μ— 걸쳐 ν•™μŠ΅λ˜μ—ˆμŠ΅λ‹ˆλ‹€.
37
+ ν•™μŠ΅ 데이터 쀑 ν•œκ΅­μ˜ μ‚¬νšŒ κ°€μΉ˜μ™€ μΈκ°„μ˜ 감정을 μ΄ν•΄ν•˜κ³  μ§€μ‹œν•œ 사항에 따라 좜λ ₯ν•  수 μžˆλŠ” 데이터λ₯Ό ν•™μŠ΅ν•˜μ˜€μŠ΅λ‹ˆλ‹€.
38
+ - ν•™μŠ΅ Instruction Datasets Format:
39
+ <pre><code>{"prompt": "prompt text", "completion": "ideal generated text"}</code></pre>
40
+
41
+ # ❸ μ‚¬μš© 사둀
42
+ ν•΄λ‹Ή λͺ¨λΈμ€ λ‹€μ–‘ν•œ μ‘μš© λΆ„μ•Όμ—μ„œ μ‚¬μš©λ  수 μžˆμŠ΅λ‹ˆλ‹€. 예λ₯Ό λ“€μ–΄:
43
+ - ꡐ윑 λΆ„μ•Ό: 역사, μˆ˜ν•™, κ³Όν•™ λ“± λ‹€μ–‘ν•œ ν•™μŠ΅ μžλ£Œμ— λŒ€ν•œ μ§ˆμ˜μ‘λ‹΅ 및 μ„€λͺ… 생성.
44
+ - λΉ„μ¦ˆλ‹ˆμŠ€: 법λ₯ , 재무, 세무 κ΄€λ ¨ μ§ˆμ˜μ— λŒ€ν•œ λ‹΅λ³€ 제곡 및 λ¬Έμ„œ μš”μ•½.
45
+ - 연ꡬ 및 λ¬Έν™”: ν•œκ΅­ μ‚¬νšŒμ™€ 문화에 맞좘 μžμ—°μ–΄ 처리 μž‘μ—…, 감정 뢄석, λ¬Έμ„œ 생성 및 λ²ˆμ—­.
46
+ - 고객 μ„œλΉ„μŠ€: μ‚¬μš©μžμ™€μ˜ λŒ€ν™” 생성 및 λ§žμΆ€ν˜• 응닡 제곡.
47
+ - 이 λͺ¨λΈμ€ ν•œκ΅­ μ‚¬νšŒμ— λŒ€ν•œ 이해와 λ‹€μ–‘ν•œ μžμ—°μ–΄ 처리 μž‘μ—…μ—μ„œ 높은 ν™œμš©λ„λ₯Ό κ°€μ§‘λ‹ˆλ‹€.
48
+ - μ˜ˆμ‹œ
49
+
50
+ 1. λ…λ„λŠ” μ–΄λ”” 땅인가?
51
+ λ…λ„λŠ” λŒ€ν•œλ―Όκ΅­μ˜ μ˜ν† λ‘œ, λ…λ„λŠ” ν•œκ΅­μ˜ λ…λ„λΌλŠ” μ΄λ¦„μœΌλ‘œλ„ 뢈리며, λ…λ„λŠ” 독도에 μœ„μΉ˜ν•œ λ…λ„λΌλŠ” μ„¬μ˜ μ΄λ¦„μž…λ‹ˆλ‹€. λ…λ„λŠ” 독도에 μœ„μΉ˜ν•œ λ…λ„λΌλŠ” μ„¬μ˜ μ΄λ¦„μž…λ‹ˆλ‹€.
52
+
53
+ 2. ν•œκ΅­ 동μͺ½μ— μžˆλŠ” λ°”λ‹€λ₯Ό 무엇이라고 ν•˜λŠ”κ°€?
54
+ ν•œκ΅­ 동μͺ½μ— μžˆλŠ” λ°”λ‹€λŠ” '동해'라고 ν•©λ‹ˆλ‹€. λ™ν•΄λŠ” ν•œκ΅­μ˜ 동μͺ½μ— μœ„μΉ˜ν•œ λ°”λ‹€λ‘œ,
55
+ λΆν•œκ³Ό 경계λ₯Ό λ§žλŒ€κ³  μžˆμŠ΅λ‹ˆλ‹€. λ™ν•΄λŠ” ν•œκ΅­μ˜ μ€‘μš”ν•œ 해상 κ΅ν†΅λ‘œ 쀑 ν•˜λ‚˜λ‘œ, 해상ꡐ톡과 어업에 μ€‘μš”ν•œ 역할을 ν•˜κ³  μžˆμŠ΅λ‹ˆλ‹€.
56
+
57
+
58
+ 3. 20μ„ΈκΈ° 초반 일본이 쑰선을 식민지화 ν•œ 것에 λŒ€ν•΄μ„œ μ–΄λ–‘ν•΄ μƒκ°ν•˜λŠ”μ§€?
59
+ 쑰선은 19μ„ΈκΈ° μ€‘λ°˜μ— 일본의 식민지화λ₯Ό λ°›μ•˜λ‹€. μ΄λŠ” 일본의 ꡰ사적, 경제적 κ°•λ ₯μ„±κ³Ό μ •μΉ˜μ  μΉ¨μž…μœΌλ‘œ 인해 λ°œμƒν–ˆλ‹€.
60
+ 쑰선은 일본의 ꡰ사적 μΉ¨μž…μ— ν•­λ³΅ν•˜κ³  식민지 관계가 μ‹œμž‘λ˜μ—ˆλ‹€. μ΄λŸ¬ν•œ μƒν™©μ—μ„œ μ‘°μ„  ꡭ민듀은 큰 λΆˆμ•ˆκ°κ³Ό μ’Œμ ˆκ°μ„ λŠκΌˆμ„ 것이닀.
61
+ κ·ΈλŸ¬λ‚˜ 일제 ν›„κΈ°μ—λŠ” 일본의 식민지화 μ²΄μ œκ°€ 점차 μ•½ν™”λ˜λ©΄μ„œ μ‘°μ„  ꡭ민듀은 더 큰 μžμœ μ™€ 독립을 μΆ”κ΅¬ν•˜κ²Œ λœλ‹€.
62
+ μ΄λŸ¬ν•œ 역사적 배경을 톡해 μ‘°μ„  ꡭ민듀은 식민지화에 λŒ€ν•΄ 맀우 뢀정적인 νƒœλ„λ₯Ό λ³΄μ˜€μ„ 것이닀.
63
+
64
+
65
+ 4. μ•ˆμ€‘κ·Ό μ˜μ‚¬κ°€ 이토 νžˆλ‘œλΆ€λ―Έλ₯Ό μ €κ²©ν•œ 사건을 μ–΄λ–»κ²Œ μƒκ°ν•˜λŠ”κ°€?
66
+ μ•ˆμ€‘κ·Ό μ˜μ‚¬λŠ” 1909λ…„ 4μ›” 27일에 이토 νžˆλ‘œλΆ€λ―Έλ₯Ό μ €κ²©ν–ˆλ‹€. κ·ΈλŠ” 일본 제ꡭ주의 μ •μΉ˜μ™€ ꡰ사적 ν™œλ™μ— λŒ€ν•΄ λ°˜λŒ€ν•˜λ©°, 일본의 무λ ₯ 지배λ₯Ό 막기 μœ„ν•΄ 이토λ₯Ό κ³΅κ²©ν–ˆλ‹€.
67
+ μ•ˆμ€‘κ·Όμ€ ν•œκ΅­ λ‚΄μ—μ„œ λ…λ¦½μš΄λ™κ°€λ‘œ μ•Œλ €μ Έ 있으며, 그의 ν–‰μœ„λŠ” ν•œκ΅­ λ‚΄ λ…λ¦½μš΄λ™μ˜ μ€‘μš”ν•œ 사건 쀑 ν•˜λ‚˜λ‘œ 여겨진닀.
68
+ μ•ˆμ€‘κ·Όμ€ 1946년에 졜초의 λ…λ¦½μš΄λ™κ°€λ‘œ μΈμ •λ°›μ•˜κ³ , 그의 ν–‰λ³΄λŠ” λ§Žμ€ λ…λ¦½μš΄λ™κ°€λ“€μ—κ²Œ μ˜κ°μ„ μ€€λ‹€.
69
+
70
+
71
+ 5. ν•œκ΅­ μ‚¬νšŒμ—μ„œ 곡동체 μ˜μ‹κ³Ό ν˜‘λ™μ˜ κ°€μΉ˜λ₯Ό μ–΄λ–»κ²Œ μ€‘μš”ν•˜κ²Œ μƒκ°ν•˜μ‹­λ‹ˆκΉŒ?
72
+ μ΄λŠ” 맀우 μ€‘μš”ν•©λ‹ˆλ‹€. ν•œκ΅­μ€ μ „ν†΅μ μœΌλ‘œ 곡동체 μ˜μ‹μ΄ κ°•ν•˜κ³ , κ°€μ‘±κ³Ό 지역 μ‚¬νšŒμ—μ„œ ν˜‘λ™μ„ μ€‘μ‹œν•˜λŠ” λ¬Έν™”κ°€ κΉŠμŠ΅λ‹ˆλ‹€.
73
+ μ΄λŸ¬ν•œ κ°€μΉ˜λŠ” μ—¬μ „νžˆ ν˜„μž¬ μ‚¬νšŒμ—μ„œ μ€‘μš”ν•œ 역할을 ν•˜λ©°, 특히 노인 λ³΄ν˜Έμ™€ 같은 μ‚¬νšŒμ  λ¬Έμ œμ—μ„œ 큰 도움이 λ©λ‹ˆλ‹€.
74
+ λ˜ν•œ, μ΄λŸ¬ν•œ κ°€μΉ˜λŠ” 개인의 행볡과 μ•ˆμ •κ°μ„ μ¦μ§„μ‹œν‚€κΈ°λ„ ν•©λ‹ˆλ‹€. λ”°λΌμ„œ μ΄λŸ¬ν•œ κ°€μΉ˜λ₯Ό μœ μ§€ν•˜κ³  λ°œμ „μ‹œν‚€λŠ” 것은 ν•œκ΅­ μ‚¬νšŒμ˜ μ€‘μš”ν•œ λͺ©ν‘œμž…λ‹ˆλ‹€.
75
+
76
+ # ❹ ν•œκ³„ β›ˆβ›ˆ
77
+ - ν•΄λ‹Ή λͺ¨λΈμ€ ν•œκ΅­μ–΄μ™€ ν•œκ΅­ 문화에 νŠΉν™”λ˜μ–΄ μžˆμœΌλ‚˜,
78
+ νŠΉμ • μ˜μ—­(예: μ΅œμ‹  ꡭ제 자료, μ „λ¬Έ λΆ„μ•Ό)의 데이터 λΆ€μ‘±μœΌλ‘œ 인해 λ‹€λ₯Έ μ–Έμ–΄ λ˜λŠ”
79
+ 문화에 λŒ€ν•œ μ‘λ‹΅μ˜ 정확성이 λ–¨μ–΄μ§ˆ 수 μžˆμŠ΅λ‹ˆλ‹€.
80
+ λ˜ν•œ, λ³΅μž‘ν•œ 논리적 사고λ₯Ό μš”κ΅¬ν•˜λŠ” λ¬Έμ œμ— λŒ€ν•΄ μ œν•œλœ μΆ”λ‘  λŠ₯λ ₯을 보일 수 있으며,
81
+ 편ν–₯된 데이터가 포함될 경우 편ν–₯된 응닡이 생성될 κ°€λŠ₯성도 μ‘΄μž¬ν•©λ‹ˆλ‹€.
82
+
83
+ # ❺ μ‚¬μš© 방법
84
+ <pre><code>
85
+ from transformers import AutoModel, AutoTokenizer
86
+
87
+ tokenizer = AutoTokenizer.from_pretrained("SEOKDONG/openchat3.5_korean_v1.0_sft")
88
+ model = AutoModel.from_pretrained("SEOKDONG/openchat3.5_korean_v1.0_sft")
89
+
90
+ input_text = """ γ€Œκ΅­λ―Όκ±΄κ°•λ³΄ν—˜λ²•γ€μ œ44μ‘°, γ€Œκ΅­λ―Όκ±΄κ°•λ³΄ν—˜λ²• μ‹œν–‰λ Ήγ€μ œ19μ‘°,γ€Œμ•½κ΄€μ˜ κ·œμ œμ— κ΄€ν•œ 법λ₯ γ€μ œ5μ‘°, γ€Œμƒλ²•γ€μ œ54μ‘° μ°Έμ‘° νŒλ‹¨ ν•΄μ€˜""" + " λ‹΅λ³€:"
91
+ inputs = tokenizer(input_text, return_tensors="pt")
92
+ with torch.no_grad():
93
+ outputs = model.generate(**inputs, max_length=1024, temperature=0.5, do_sample=True, repetition_penalty=1.15)
94
+
95
+ result = tokenizer.decode(outputs[0], skip_special_tokens=True)
96
+ print(result)
97
+ </code></pre>
98
+
99
+
100
+ ---
101
+ Here’s the English version of the provided text:
102
+
103
+
104
+
105
+ # ❢ Model Description
106
+
107
+ **Model Name and Key Features**:
108
+ This Model is based on the OpenChat 3.5 model, fine-tuned using the SFT method on the Mistral 7B model.
109
+ It is designed to understand Korean and various cultural contexts, utilizing data from 135 domains in Korean society.
110
+ The model supports tasks such as text generation, conversation inference, document summarization,
111
+ question answering, sentiment analysis, and other NLP tasks.
112
+ Its applications span fields like law, finance, science, education, business, and cultural research.
113
+
114
+ **Model Architecture**:
115
+ This Model is a high-performance language model with 7 billion parameters based on the Mistral 7B model.
116
+ It uses OpenChat 3.5 as the foundation and is fine-tuned using SFT to excel in Korean language and culture.
117
+ The streamlined Mistral 7B architecture ensures fast inference and memory efficiency,
118
+ optimized for various NLP tasks like text generation, question answering, document summarization, and sentiment analysis.
119
+
120
+ ---
121
+
122
+ # ❷ Training Data
123
+
124
+ This Model was trained on 3.6GB of data, comprising 2.33 million Q&A instances.
125
+ This includes 1.33 million multiple-choice questions across 53 domains such as history,
126
+ finance, law, tax, and science, trained with the Chain of Thought method. Additionally,
127
+ 1.3 million short-answer questions cover 38 domains including history, finance, and law.
128
+
129
+ **Training Instruction Dataset Format**:
130
+ `{"prompt": "prompt text", "completion": "ideal generated text"}`
131
+
132
+ ---
133
+
134
+ # ❸ Use Cases
135
+
136
+ This Model can be used across multiple fields, such as:
137
+
138
+ - **Education**: Answering questions and generating explanations for subjects like history, math, and science.
139
+ - **Business**: Providing responses and summaries for legal, financial, and tax-related queries.
140
+ - **Research and Culture**: Performing NLP tasks, sentiment analysis, document generation, and translation.
141
+ - **Customer Service**: Generating conversations and personalized responses for users.
142
+
143
+ This model is highly versatile in various NLP tasks.
144
+
145
+ ---
146
+
147
+ # ❹ Limitations
148
+
149
+ This Model is specialized in Korean language and culture.
150
+ However, it may lack accuracy in responding to topics outside its scope,
151
+ such as international or specialized data.
152
+ Additionally, it may have limited reasoning ability for complex logical problems and
153
+ may produce biased responses if trained on biased data.
openchat3.5_korean_v1.0_sft.Q4_0.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b15a21d2d316881a6833082967818f115cc548557beb6e6ad1f39a2a5e41c18d
3
+ size 4108928864