supermy
/

c2m-mt5

+---
+language: zh
+datasets: c2m
+inference:
+  parameters:
+    max_length: 108
+    num_return_sequences: 1
+    do_sample: True
+widget:
+- text: "晋太元中，武陵人捕鱼为业。缘溪行，忘路之远近。忽逢桃花林，夹岸数百步，中无杂树，芳草鲜美，落英缤纷。渔人甚异之，复前行，欲穷其林。林尽水源，便得一山，山有小口，仿佛若有光。便舍船，从口入。初极狭，才通人。复行数十步，豁然开朗。土地平旷，屋舍俨然，有良田、美池、桑竹之属。阡陌交通，鸡犬相闻。其中往来种作，男女衣着，悉如外人。黄发垂髫，并怡然自乐。"
+  example_title: "桃花源记"
+- text: "往者不可谏,来者犹可追。"
+  example_title: "来者犹可追"
+- text: "逝者如斯夫！不舍昼夜。"
+  example_title: "逝者如斯夫"
+---
+# 文言文 to 现代文
+## Model description
+## How to use
+使用 pipeline 调用模型:
+```python
+>>> from transformers import pipeline
+>>> model_checkpoint = "supermy/c2m-mt5"
+>>> translator = pipeline("translation",
+		model=model_checkpoint,
+		num_return_sequences=1,
+		max_length=52,
+		truncation=True,)
+>>> translator("往者不可谏,来者犹可追。")
+[{'translation_text': '过 去 的 事 情 不能 劝 谏 ， 未来 的 事 情 还 可以 追 回 来 。 如 果 过 去 的 事 情 不能 劝 谏 ， 那 么 ， 未来 的 事 情 还 可以 追 回 来 。 如 果 过 去 的 事 情'}]
+>>> translator("福兮祸所伏，祸兮福所倚。",do_sample=True)
+[{'translation_text': '幸 福 是 祸 患 所 隐 藏 的 ， 灾 祸 是 福 祸 所 依 托 的 。 这 些 都 是 幸 福 所 依 托 的 。 这 些 都 是 幸 福 所 带 来 的 。 幸 福 啊 ， 也 是 幸 福'}]
+>>> translator("成事不说，遂事不谏，既往不咎。", num_return_sequences=1,do_sample=True)
+[{'translation_text': '事 情 不 高 兴 ， 事 情 不 劝 谏 ， 过 去 的 事 就 不 会 责 怪 。 事 情 没 有 多 久 了 ， 事 情 没 有 多 久 ， 事 情 没 有 多 久 了 ， 事 情 没 有 多'}]
+>>> translator("逝者如斯夫！不舍昼夜。",num_return_sequences=1,max_length=30)
+[{'translation_text': '逝 去 的 人 就 像 这 样 啊 ， 不分 昼夜 地 去 追 赶 它 们 。 这 样 的 人 就 不 会 忘 记'}]
+```
+Here is how to use this model to get the features of a given text in PyTorch:
+```python
+from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
+tokenizer = AutoTokenizer.from_pretrained("supermy/c2m-mt5")
+model = AutoModelForSeq2SeqLM.from_pretrained("supermy/c2m-mt5")
+text = "用你喜欢的任何文本替换我。"
+encoded_input = tokenizer(text, return_tensors='pt')
+output = model(**encoded_input)
+```
+## Training data
+非常全的文言文（古文）-现代文平行语料，基本涵盖了大部分经典古籍著作。
+原始爬取的数据是篇章级对齐，经过脚本分句（按照句号分号感叹号问号划分）以及人工校对，形成共计约96万句对。目录bitext下是文言文-现代文对齐的平行数据。此外，目录source下是文言文单语数据，target下是现代文单语数据，这两个目录下的文件内容按行对齐。
+以下为数据统计信息。其中，短篇章中包括了《论语》、《孟子》、《左传》等篇幅较短的古籍，已和《资治通鉴》合并。
+|书名|句数
+|:--|:--|
+短篇章和资治通鉴|348727
+元史|21182
+北史|25823
+北书|10947
+南史|13838
+南齐书|13137
+史记|17701
+后汉书|17753
+周书|14930
+太平广记|59358
+宋书|23794
+宋史|77853
+徐霞客游记|22750
+新五代史|10147
+新唐书|12359
+旧五代史|11377
+旧唐书|29185
+明史|85179
+晋书|21133
+梁书|14318
+水经注全|11630
+汉书|37622
+辽史|9278
+金史|13758
+陈书|7096
+隋书|8204
+魏书|28178
+**总计**|**967257**
+《短篇章和资治通鉴》中各书籍统计如下（此部分数据量不完全准确）：
+|书名|句数
+|:--|:--|
+资治通鉴|7.95w
+左传|1.09w
+大学章句集注|	   86
+反经|			 4211
+公孙龙子|		   73
+管子|			 6266
+鬼谷子|		  385
+韩非子|		 4325
+淮南子|		 2669
+黄帝内经|	 6162
+皇帝四经|		  243
+将苑|			  100
+金刚经|		  193
+孔子家语|		  138
+老子|			  398
+了凡四训|		   31
+礼记|			 4917
+列子|			 1735
+六韬|			  693
+六祖坛经|		  949
+论语|			  988
+吕氏春秋|	 2473
+孟子|			 1654
+梦溪笔谈| 		 1280
+墨子|		 2921
+千字文|		   82
+清史稿|		 1604
+三字经|		  234
+山海经|		  919
+伤寒论|		  712
+商君书|		  916
+尚书|		 1048
+世说新语|		 3044
+司马法|		  132
+搜神记|		 1963
+搜神后记|		  540
+素书|			   61
+孙膑兵法|		  230
+孙子兵法|		  338
+天工开物|		  807
+尉缭子|		  226
+文昌孝经|		  194
+文心雕龙|		 1388
+吴子|			  136
+孝经|		      102
+笑林广记|		 1496
+荀子|			 3131
+颜氏家训|		  510
+仪礼|			 2495
+易传|			  711
+逸周书|		 1505
+战国策|		 3318
+���观政要|		 1291
+中庸|			  206
+周礼|			 2026
+周易|			  460
+庄子|			 1698
+百战奇略|		  800
+论衡| 1.19w
+智囊|2165
+罗织经|188
+朱子家训|31
+抱朴子|217
+地藏经|547
+国语|3841
+容斋随笔|2921
+幼学琼林|1372
+三略|268
+围炉夜话|387
+冰鉴|120
+如果您使用该语料库，请注明出处：https://github.com/NiuTrans/Classical-Modern
+感谢为该语料库做出贡献的成员：丁佳鹏、杨文权、刘晓晴、曹润柘、罗应峰。
+```
+```
+## Training procedure
+在英伟达16G显卡训练了 4 天整，共计68 次。
+[文言文数据集](https://huggingface.co/datasets/supermy/Classical-Modern) 训练数据. 模型 [MT5](google/mt5-small)
+```
+[INFO|trainer.py:1628] 2022-12-15 16:08:36,696 >> ***** Running training *****
+[INFO|trainer.py:1629] 2022-12-15 16:08:36,696 >>   Num examples = 967255
+[INFO|trainer.py:1630] 2022-12-15 16:08:36,697 >>   Num Epochs = 6
+[INFO|trainer.py:1631] 2022-12-15 16:08:36,697 >>   Instantaneous batch size per device = 12
+[INFO|trainer.py:1632] 2022-12-15 16:08:36,697 >>   Total train batch size (w. parallel, distributed & accumulation) = 12
+[INFO|trainer.py:1633] 2022-12-15 16:08:36,697 >>   Gradient Accumulation steps = 1
+[INFO|trainer.py:1634] 2022-12-15 16:08:36,697 >>   Total optimization steps = 483630
+[INFO|trainer.py:1654] 2022-12-15 16:08:36,698 >>   Continuing training from checkpoint, will skip to saved global_step
+[INFO|trainer.py:1655] 2022-12-15 16:08:36,698 >>   Continuing training from epoch 5
+[INFO|trainer.py:1656] 2022-12-15 16:08:36,698 >>   Continuing training from global step 465000
+{'loss': 5.2906, 'learning_rate': 1.8743667679837894e-06, 'epoch': 5.78}
+{'loss': 5.3196, 'learning_rate': 1.8226743584971985e-06, 'epoch': 5.78}
+{'loss': 5.3467, 'learning_rate': 6.513243595310464e-08, 'epoch': 5.99}
+{'loss': 5.3363, 'learning_rate': 1.344002646651366e-08, 'epoch': 6.0}
+{'train_runtime': 6277.5234, 'train_samples_per_second': 924.494, 'train_steps_per_second': 77.042, 'train_loss': 0.2044413571775476, 'epoch': 6.0}
+***** train metrics *****
+  epoch                    =        6.0
+  train_loss               =     0.2044
+  train_runtime            = 1:44:37.52
+  train_samples            =     967255
+  train_samples_per_second =    924.494
+  train_steps_per_second   =     77.042
+12/15/2022 17:53:23 - INFO - __main__ - *** Evaluate ***
+[INFO|trainer.py:2920] 2022-12-15 17:53:23,729 >> ***** Running Evaluation *****
+[INFO|trainer.py:2922] 2022-12-15 17:53:23,729 >>   Num examples = 200
+[INFO|trainer.py:2925] 2022-12-15 17:53:23,729 >>   Batch size = 12
+100%|██████████| 17/17 [00:07<00:00,  2.29it/s]
+[INFO|modelcard.py:443] 2022-12-15 17:53:32,737 >> Dropping the following result as it does not have all the necessary fields:
+{'task': {'name': 'Translation', 'type': 'translation'}, 'metrics': [{'name': 'Bleu', 'type': 'bleu', 'value': 0.7225}]}
+***** eval metrics *****
+  epoch                   =        6.0
+  eval_bleu               =     0.7225
+  eval_gen_len            =     12.285
+  eval_loss               =     6.6782
+  eval_runtime            = 0:00:07.77
+  eval_samples            =        200
+  eval_samples_per_second =     25.721
+  eval_steps_per_second   =      2.186
+```