Edit model card

Release Notes

  • this model is finetuned from larryvrh/mt5-translation-ja_zh great appreciation to the creator

  • reason for making this model
    I was testing the model for translation of some of the Japanese game to Chinese
    There are several production issues with the original model
    so I did some "supervised" training just to fix them

模型公开声明

  • 这个模型由 mt5-translation-ja_zh 继续训练得来

  • 制作这个模型的原因
    尝试使用各类模型进行游戏文本翻译的工作,游戏文本有非常典型的文本对应关系
    尤其是游戏文本的翻译中,部分token必须被翻译,部分token必须保持原样,其主要的文本行数必须保持原样
    因mt5的预训练包括对应关系,因而较为优秀
    因为发现大佬已经进行了翻译的预训练,就直接在基础上精调
    游戏文本,几乎很少超过100字,因此larryvrh的模型基本上完全符合需求
    修复了一些对应的翻译出的位置问题,训练了一些需要的翻译词汇

  • 本模型缺陷
    暂时只制作了mt5-large模型,需要大概8g以上的显存,过剩比较多
    为了方便使用,设置成大batch一波推的做法,充分利用gpu资源,但它不会看上下文,因此认为是很大的弊端
    数据集中固定翻译的词汇量不足,因此很多翻译会给你它知道的其他语言(一般是英文)
    经过一些努力矫正后,它现在会zero-shot的给你一句空耳(出现这个zero-shot特性的时候我们翻译组都绷不住了)

简单的后端应用

还没稳定调试,慎用,需要将设置中的模型名称改为该模型名称并启动

A more precise example using it

使用指南

from transformers import pipeline
model_name="iryneko571/mt5-translation-ja_zh-game-large"
#pipe = pipeline("translation",model=model_name,tokenizer=model_name,repetition_penalty=1.4,batch_size=1,max_length=256)
pipe = pipeline("translation",
  model=model_name,
  repetition_penalty=1.4,
  batch_size=1,
  max_length=256
  )

def translate_batch(batch, language='<-ja2zh->'): # batch is an array of string
    i=0 # quickly format the list
    while i<len(batch):
        batch[i]=f'{language} {batch[i]}'
        i+=1
    translated=pipe(batch)
    result=[]
    i=0
    while i<len(translated):
        result.append(translated[i]['translation_text'])
        i+=1
    return result

inputs=[]

print(translate_batch(inputs))

simple webui

暂时的网页UI

I mean nobody stops you from connecting a gradio yourself, if you put that in community response i will make one.
Currently working on a more enterprise approach, would take a while to code pages

  • integrating with xunity autotranslator
    • connect to redis to block massive request flood (and harvest data)
    • work with different types of linebreaks such as \\n, \n and \r\n
  • create support to translate whole json data file
    • also filter out the non-jp text
      • and hope this ai keeps the code

roadmap

train mt-5 small and rwkv
make lora training script and ui
create algorism that save no-confidence translations into a db for manual correction
search the manual translatioin db with sentencepiece search to make it work with "previous translations"

搞mt5-small和rwkv,rwkv能读上下文
制造lora training脚本和ui,把炼丹炉搭起来方便实用
让ai将不确定的翻译文本导出用于人工翻译矫正
使用sentencepiece进行ai检索,获取相似的“上文翻译“,大幅提高ai翻译用词的一致性

how to find me

找到作者

Discord Server:
https://discord.gg/JmjPmJjA
If you need any help, a test server or just want to chat
如果需要帮助,需要试试最新的版本,或者只是为了看下我是啥,可以进channel看看(这边允许发布这个吗?)

Downloads last month
1
Safetensors
Model size
1.23B params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.