mav23's picture
Upload folder using huggingface_hub
b885bbc verified
|
raw
history blame
9.7 kB
metadata
license: apache-2.0
language:
  - ko
  - en
base_model:
  - openchat/openchat_3.5
pipeline_tag: text-generation
datasets:
  - AIDX-ktds/ko_leaderboard

β›± ν•΄λ‹Ή λͺ¨λΈμ€μ€ openchat3.5 을 Foundation λͺ¨λΈλ‘œ ν•˜λŠ” ν•œκ΅­μ–΄ 및 ν•œκ΅­μ˜ λ‹€μ–‘ν•œ

문화에 μ μš©ν•  수 μžˆλ„λ‘ ν•˜κΈ° μœ„ν•΄κ°œλ°œ λ˜μ—ˆμœΌλ©°

자체 μ œμž‘ν•œ 53μ˜μ—­μ˜ ν•œκ΅­μ–΄ 데이터λ₯Ό ν™œμš©ν•˜μ—¬ ν•œκ΅­ μ‚¬νšŒ κ°€μΉ˜μ™€

λ¬Έν™”λ₯Ό μ΄ν•΄ν•˜λŠ” λͺ¨λΈ μž…λ‹ˆλ‹€. ✌

❢ λͺ¨λΈ μ„€λͺ…

  • λͺ¨λΈλͺ… 및 μ£Όμš”κΈ°λŠ₯: ν•΄λ‹Ή λͺ¨λΈμ€μ€ OpenChat 3.5 λͺ¨λΈμ„ 기반으둜 SFT λ°©μ‹μœΌλ‘œ νŒŒμΈνŠœλ‹λœ Mistral 7B / openchat3.5 기반 λͺ¨λΈμž…λ‹ˆλ‹€. ν•œκ΅­μ–΄μ™€ ν•œκ΅­μ˜ λ‹€μ–‘ν•œ 문화적 λ§₯락을 μ΄ν•΄ν•˜λ„λ‘ μ„€κ³„λ˜μ—ˆμœΌλ©° ✨✨, 자체 μ œμž‘ν•œ 53개 μ˜μ—­μ˜ ν•œκ΅­μ–΄ 데이터λ₯Ό ν™œμš©ν•΄ ν•œκ΅­ μ‚¬νšŒμ˜ κ°€μΉ˜μ™€ λ¬Έν™”λ₯Ό λ°˜μ˜ν•©λ‹ˆλ‹€. μ£Όμš” κΈ°λŠ₯μœΌλ‘œλŠ” ν…μŠ€νŠΈ 생성, λŒ€ν™” μΆ”λ‘ , λ¬Έμ„œ μš”μ•½, μ§ˆμ˜μ‘λ‹΅, 감정 뢄석 및 μžμ—°μ–΄ 처리 κ΄€λ ¨ λ‹€μ–‘ν•œ μž‘μ—…μ„ μ§€μ›ν•˜λ©°, ν™œμš© λΆ„μ•ΌλŠ” 법λ₯ , 재무, κ³Όν•™, ꡐ윑, λΉ„μ¦ˆλ‹ˆμŠ€, λ¬Έν™” 연ꡬ λ“± λ‹€μ–‘ν•œ λΆ„μ•Όμ—μ„œ μ‘μš©λ  수 μžˆμŠ΅λ‹ˆλ‹€.
  • λͺ¨λΈ μ•„ν‚€ν…μ²˜:ν•΄λ‹Ή λͺ¨λΈμ€μ€ Mistral 7B λͺ¨λΈμ„ 기반으둜, νŒŒλΌλ―Έν„° μˆ˜λŠ” 70μ–΅ 개(7B)둜 κ΅¬μ„±λœ κ³ μ„±λŠ₯ μ–Έμ–΄ λͺ¨λΈμž…λ‹ˆλ‹€. 이 λͺ¨λΈμ€ OpenChat 3.5λ₯Ό νŒŒμš΄λ°μ΄μ…˜ λͺ¨λΈλ‘œ μ‚Όμ•„, SFT(지도 λ―Έμ„Έ μ‘°μ •) 방식을 톡해 ν•œκ΅­μ–΄μ™€ ν•œκ΅­ 문화에 νŠΉν™”λœ μ„±λŠ₯을 λ°œνœ˜ν•˜λ„λ‘ ν›ˆλ ¨λ˜μ—ˆμŠ΅λ‹ˆλ‹€. Mistral 7B의 κ²½λŸ‰ν™”λœ κ΅¬μ‘°λŠ” λΉ λ₯Έ μΆ”λ‘  속도와 λ©”λͺ¨λ¦¬ νš¨μœ¨μ„±μ„ 보μž₯ν•˜λ©°, λ‹€μ–‘ν•œ μžμ—°μ–΄ 처리 μž‘μ—…μ— μ ν•©ν•˜κ²Œ μ΅œμ ν™”λ˜μ–΄ μžˆμŠ΅λ‹ˆλ‹€. 이 μ•„ν‚€ν…μ²˜λŠ” ν…μŠ€νŠΈ 생성, μ§ˆμ˜μ‘λ‹΅, λ¬Έμ„œ μš”μ•½, 감정 뢄석과 같은 λ‹€μ–‘ν•œ μž‘μ—…μ—μ„œ νƒμ›”ν•œ μ„±λŠ₯을 λ³΄μ—¬μ€λ‹ˆλ‹€.

❷ ν•™μŠ΅ 데이터

  • ν•΄λ‹Ή λͺ¨λΈμ€μ€ 자체 κ°œλ°œν•œ 총 3.6GB 크기의 데이터λ₯Ό λ°”νƒ•μœΌλ‘œ ν•™μŠ΅λ˜μ—ˆμŠ΅λ‹ˆλ‹€. λͺ¨λ‘ 233만 건의 QnA, μš”μ•½, λΆ„λ₯˜ λ“± 데이터λ₯Ό ν¬ν•¨ν•˜λ©°, κ·Έ 쀑 133만 건은 53개 μ˜μ—­μ˜ 객관식 문제둜 κ΅¬μ„±λ˜μ—ˆμŠ΅λ‹ˆλ‹€. 이 μ˜μ—­μ—λŠ” ν•œκ΅­μ‚¬, μ‚¬νšŒ, 재무, 법λ₯ , 세무, μˆ˜ν•™, 생물, 물리, ν™”ν•™ 등이 ν¬ν•¨λ˜λ©°, Chain of Thought λ°©μ‹μœΌλ‘œ ν•™μŠ΅λ˜μ—ˆμŠ΅λ‹ˆλ‹€. λ˜ν•œ 130만 건의 주관식 λ¬Έμ œλŠ” ν•œκ΅­μ‚¬, 재무, 법λ₯ , 세무, μˆ˜ν•™ λ“± 38개 μ˜μ—­μ— 걸쳐 ν•™μŠ΅λ˜μ—ˆμŠ΅λ‹ˆλ‹€. ν•™μŠ΅ 데이터 쀑 ν•œκ΅­μ˜ μ‚¬νšŒ κ°€μΉ˜μ™€ μΈκ°„μ˜ 감정을 μ΄ν•΄ν•˜κ³  μ§€μ‹œν•œ 사항에 따라 좜λ ₯ν•  수 μžˆλŠ” 데이터λ₯Ό ν•™μŠ΅ν•˜μ˜€μŠ΅λ‹ˆλ‹€.
  • ν•™μŠ΅ Instruction Datasets Format:
    {"prompt": "prompt text", "completion": "ideal generated text"}

❸ μ‚¬μš© 사둀

ν•΄λ‹Ή λͺ¨λΈμ€ λ‹€μ–‘ν•œ μ‘μš© λΆ„μ•Όμ—μ„œ μ‚¬μš©λ  수 μžˆμŠ΅λ‹ˆλ‹€. 예λ₯Ό λ“€μ–΄:

  • ꡐ윑 λΆ„μ•Ό: 역사, μˆ˜ν•™, κ³Όν•™ λ“± λ‹€μ–‘ν•œ ν•™μŠ΅ μžλ£Œμ— λŒ€ν•œ μ§ˆμ˜μ‘λ‹΅ 및 μ„€λͺ… 생성.
  • λΉ„μ¦ˆλ‹ˆμŠ€: 법λ₯ , 재무, 세무 κ΄€λ ¨ μ§ˆμ˜μ— λŒ€ν•œ λ‹΅λ³€ 제곡 및 λ¬Έμ„œ μš”μ•½.
  • 연ꡬ 및 λ¬Έν™”: ν•œκ΅­ μ‚¬νšŒμ™€ 문화에 맞좘 μžμ—°μ–΄ 처리 μž‘μ—…, 감정 뢄석, λ¬Έμ„œ 생성 및 λ²ˆμ—­.
  • 고객 μ„œλΉ„μŠ€: μ‚¬μš©μžμ™€μ˜ λŒ€ν™” 생성 및 λ§žμΆ€ν˜• 응닡 제곡.
  • 이 λͺ¨λΈμ€ ν•œκ΅­ μ‚¬νšŒμ— λŒ€ν•œ 이해와 λ‹€μ–‘ν•œ μžμ—°μ–΄ 처리 μž‘μ—…μ—μ„œ 높은 ν™œμš©λ„λ₯Ό κ°€μ§‘λ‹ˆλ‹€.
  • μ˜ˆμ‹œ
  1. λ…λ„λŠ” μ–΄λ”” 땅인가? λ…λ„λŠ” λŒ€ν•œλ―Όκ΅­μ˜ μ˜ν† λ‘œ, λ…λ„λŠ” ν•œκ΅­μ˜ λ…λ„λΌλŠ” μ΄λ¦„μœΌλ‘œλ„ 뢈리며, λ…λ„λŠ” 독도에 μœ„μΉ˜ν•œ λ…λ„λΌλŠ” μ„¬μ˜ μ΄λ¦„μž…λ‹ˆλ‹€. λ…λ„λŠ” 독도에 μœ„μΉ˜ν•œ λ…λ„λΌλŠ” μ„¬μ˜ μ΄λ¦„μž…λ‹ˆλ‹€.

  2. ν•œκ΅­ 동μͺ½μ— μžˆλŠ” λ°”λ‹€λ₯Ό 무엇이라고 ν•˜λŠ”κ°€? ν•œκ΅­ 동μͺ½μ— μžˆλŠ” λ°”λ‹€λŠ” '동해'라고 ν•©λ‹ˆλ‹€. λ™ν•΄λŠ” ν•œκ΅­μ˜ 동μͺ½μ— μœ„μΉ˜ν•œ λ°”λ‹€λ‘œ, λΆν•œκ³Ό 경계λ₯Ό λ§žλŒ€κ³  μžˆμŠ΅λ‹ˆλ‹€. λ™ν•΄λŠ” ν•œκ΅­μ˜ μ€‘μš”ν•œ 해상 κ΅ν†΅λ‘œ 쀑 ν•˜λ‚˜λ‘œ, 해상ꡐ톡과 어업에 μ€‘μš”ν•œ 역할을 ν•˜κ³  μžˆμŠ΅λ‹ˆλ‹€.

  3. 20μ„ΈκΈ° 초반 일본이 쑰선을 식민지화 ν•œ 것에 λŒ€ν•΄μ„œ μ–΄λ–‘ν•΄ μƒκ°ν•˜λŠ”μ§€? 쑰선은 19μ„ΈκΈ° μ€‘λ°˜μ— 일본의 식민지화λ₯Ό λ°›μ•˜λ‹€. μ΄λŠ” 일본의 ꡰ사적, 경제적 κ°•λ ₯μ„±κ³Ό μ •μΉ˜μ  μΉ¨μž…μœΌλ‘œ 인해 λ°œμƒν–ˆλ‹€. 쑰선은 일본의 ꡰ사적 μΉ¨μž…μ— ν•­λ³΅ν•˜κ³  식민지 관계가 μ‹œμž‘λ˜μ—ˆλ‹€. μ΄λŸ¬ν•œ μƒν™©μ—μ„œ μ‘°μ„  ꡭ민듀은 큰 λΆˆμ•ˆκ°κ³Ό μ’Œμ ˆκ°μ„ λŠκΌˆμ„ 것이닀. κ·ΈλŸ¬λ‚˜ 일제 ν›„κΈ°μ—λŠ” 일본의 식민지화 μ²΄μ œκ°€ 점차 μ•½ν™”λ˜λ©΄μ„œ μ‘°μ„  ꡭ민듀은 더 큰 μžμœ μ™€ 독립을 μΆ”κ΅¬ν•˜κ²Œ λœλ‹€. μ΄λŸ¬ν•œ 역사적 배경을 톡해 μ‘°μ„  ꡭ민듀은 식민지화에 λŒ€ν•΄ 맀우 뢀정적인 νƒœλ„λ₯Ό λ³΄μ˜€μ„ 것이닀.

  4. μ•ˆμ€‘κ·Ό μ˜μ‚¬κ°€ 이토 νžˆλ‘œλΆ€λ―Έλ₯Ό μ €κ²©ν•œ 사건을 μ–΄λ–»κ²Œ μƒκ°ν•˜λŠ”κ°€? μ•ˆμ€‘κ·Ό μ˜μ‚¬λŠ” 1909λ…„ 4μ›” 27일에 이토 νžˆλ‘œλΆ€λ―Έλ₯Ό μ €κ²©ν–ˆλ‹€. κ·ΈλŠ” 일본 제ꡭ주의 μ •μΉ˜μ™€ ꡰ사적 ν™œλ™μ— λŒ€ν•΄ λ°˜λŒ€ν•˜λ©°, 일본의 무λ ₯ 지배λ₯Ό 막기 μœ„ν•΄ 이토λ₯Ό κ³΅κ²©ν–ˆλ‹€. μ•ˆμ€‘κ·Όμ€ ν•œκ΅­ λ‚΄μ—μ„œ λ…λ¦½μš΄λ™κ°€λ‘œ μ•Œλ €μ Έ 있으며, 그의 ν–‰μœ„λŠ” ν•œκ΅­ λ‚΄ λ…λ¦½μš΄λ™μ˜ μ€‘μš”ν•œ 사건 쀑 ν•˜λ‚˜λ‘œ 여겨진닀. μ•ˆμ€‘κ·Όμ€ 1946년에 졜초의 λ…λ¦½μš΄λ™κ°€λ‘œ μΈμ •λ°›μ•˜κ³ , 그의 ν–‰λ³΄λŠ” λ§Žμ€ λ…λ¦½μš΄λ™κ°€λ“€μ—κ²Œ μ˜κ°μ„ μ€€λ‹€.

  5. ν•œκ΅­ μ‚¬νšŒμ—μ„œ 곡동체 μ˜μ‹κ³Ό ν˜‘λ™μ˜ κ°€μΉ˜λ₯Ό μ–΄λ–»κ²Œ μ€‘μš”ν•˜κ²Œ μƒκ°ν•˜μ‹­λ‹ˆκΉŒ? μ΄λŠ” 맀우 μ€‘μš”ν•©λ‹ˆλ‹€. ν•œκ΅­μ€ μ „ν†΅μ μœΌλ‘œ 곡동체 μ˜μ‹μ΄ κ°•ν•˜κ³ , κ°€μ‘±κ³Ό 지역 μ‚¬νšŒμ—μ„œ ν˜‘λ™μ„ μ€‘μ‹œν•˜λŠ” λ¬Έν™”κ°€ κΉŠμŠ΅λ‹ˆλ‹€. μ΄λŸ¬ν•œ κ°€μΉ˜λŠ” μ—¬μ „νžˆ ν˜„μž¬ μ‚¬νšŒμ—μ„œ μ€‘μš”ν•œ 역할을 ν•˜λ©°, 특히 노인 λ³΄ν˜Έμ™€ 같은 μ‚¬νšŒμ  λ¬Έμ œμ—μ„œ 큰 도움이 λ©λ‹ˆλ‹€. λ˜ν•œ, μ΄λŸ¬ν•œ κ°€μΉ˜λŠ” 개인의 행볡과 μ•ˆμ •κ°μ„ μ¦μ§„μ‹œν‚€κΈ°λ„ ν•©λ‹ˆλ‹€. λ”°λΌμ„œ μ΄λŸ¬ν•œ κ°€μΉ˜λ₯Ό μœ μ§€ν•˜κ³  λ°œμ „μ‹œν‚€λŠ” 것은 ν•œκ΅­ μ‚¬νšŒμ˜ μ€‘μš”ν•œ λͺ©ν‘œμž…λ‹ˆλ‹€.

❹ ν•œκ³„ β›ˆβ›ˆ

  • ν•΄λ‹Ή λͺ¨λΈμ€ ν•œκ΅­μ–΄μ™€ ν•œκ΅­ 문화에 νŠΉν™”λ˜μ–΄ μžˆμœΌλ‚˜, νŠΉμ • μ˜μ—­(예: μ΅œμ‹  ꡭ제 자료, μ „λ¬Έ λΆ„μ•Ό)의 데이터 λΆ€μ‘±μœΌλ‘œ 인해 λ‹€λ₯Έ μ–Έμ–΄ λ˜λŠ” 문화에 λŒ€ν•œ μ‘λ‹΅μ˜ 정확성이 λ–¨μ–΄μ§ˆ 수 μžˆμŠ΅λ‹ˆλ‹€. λ˜ν•œ, λ³΅μž‘ν•œ 논리적 사고λ₯Ό μš”κ΅¬ν•˜λŠ” λ¬Έμ œμ— λŒ€ν•΄ μ œν•œλœ μΆ”λ‘  λŠ₯λ ₯을 보일 수 있으며, 편ν–₯된 데이터가 포함될 경우 편ν–₯된 응닡이 생성될 κ°€λŠ₯성도 μ‘΄μž¬ν•©λ‹ˆλ‹€.

❺ μ‚¬μš© 방법


  from transformers import AutoModel, AutoTokenizer
  
  tokenizer = AutoTokenizer.from_pretrained("SEOKDONG/openchat3.5_korean_v1.0_sft")
  model = AutoModel.from_pretrained("SEOKDONG/openchat3.5_korean_v1.0_sft")

    input_text =  """ γ€Œκ΅­λ―Όκ±΄κ°•λ³΄ν—˜λ²•γ€μ œ44μ‘°, γ€Œκ΅­λ―Όκ±΄κ°•λ³΄ν—˜λ²• μ‹œν–‰λ Ήγ€μ œ19μ‘°,γ€Œμ•½κ΄€μ˜ κ·œμ œμ— κ΄€ν•œ 법λ₯ γ€μ œ5μ‘°, γ€Œμƒλ²•γ€μ œ54μ‘° μ°Έμ‘° νŒλ‹¨ ν•΄μ€˜""" + " λ‹΅λ³€:"
    inputs = tokenizer(input_text, return_tensors="pt")
  with torch.no_grad():
        outputs = model.generate(**inputs, max_length=1024,  temperature=0.5, do_sample=True, repetition_penalty=1.15)
   
  result = tokenizer.decode(outputs[0], skip_special_tokens=True)
  print(result)

Here’s the English version of the provided text:

❢ Model Description

Model Name and Key Features:
This Model is based on the OpenChat 3.5 model, fine-tuned using the SFT method on the Mistral 7B model. It is designed to understand Korean and various cultural contexts, utilizing data from 135 domains in Korean society. The model supports tasks such as text generation, conversation inference, document summarization, question answering, sentiment analysis, and other NLP tasks. Its applications span fields like law, finance, science, education, business, and cultural research.

Model Architecture:
This Model is a high-performance language model with 7 billion parameters based on the Mistral 7B model. It uses OpenChat 3.5 as the foundation and is fine-tuned using SFT to excel in Korean language and culture. The streamlined Mistral 7B architecture ensures fast inference and memory efficiency, optimized for various NLP tasks like text generation, question answering, document summarization, and sentiment analysis.


❷ Training Data

This Model was trained on 3.6GB of data, comprising 2.33 million Q&A instances. This includes 1.33 million multiple-choice questions across 53 domains such as history, finance, law, tax, and science, trained with the Chain of Thought method. Additionally, 1.3 million short-answer questions cover 38 domains including history, finance, and law.

Training Instruction Dataset Format:
{"prompt": "prompt text", "completion": "ideal generated text"}


❸ Use Cases

This Model can be used across multiple fields, such as:

  • Education: Answering questions and generating explanations for subjects like history, math, and science.
  • Business: Providing responses and summaries for legal, financial, and tax-related queries.
  • Research and Culture: Performing NLP tasks, sentiment analysis, document generation, and translation.
  • Customer Service: Generating conversations and personalized responses for users.

This model is highly versatile in various NLP tasks.


❹ Limitations

This Model is specialized in Korean language and culture. However, it may lack accuracy in responding to topics outside its scope, such as international or specialized data. Additionally, it may have limited reasoning ability for complex logical problems and may produce biased responses if trained on biased data.