hannuta commited on
Commit
4361b48
1 Parent(s): fc337db

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +102 -0
README.md CHANGED
@@ -1,3 +1,105 @@
1
  ---
 
2
  license: mit
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: de
3
  license: mit
4
+ inference: false
5
+ tags:
6
+ - gptj
7
+ - title generation
8
+ - headline generation
9
+ - teaser generation
10
+ - news
11
  ---
12
+ # Model Card for Model GPT-J-Title-Teaser-1k
13
+
14
+ <!-- Provide a quick summary of what the model is/does. -->
15
+
16
+ gptj-title-teaser-1k
17
+ Version 1.0 / 22 December 2022
18
+
19
+ A proof of concept for multitask fine-tuning [GPT-J-6B-8bit](https://huggingface.co/hivemind/gpt-j-6B-8bit) for title and teaser generation for german news.
20
+
21
+ # Model Details
22
+
23
+ ## Model Description
24
+
25
+ - **Developed by:** snipaid
26
+ - **Model type:** gptj
27
+ - **Language(s) (NLP):** de
28
+ - **License:** MIT
29
+ - **Finetuned from model:** [GPT-J-6B-8bit](https://huggingface.co/hivemind/gpt-j-6B-8bit)
30
+
31
+ # Uses
32
+
33
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
34
+
35
+ This model is not intended for use! It is a preliminary version of gptj-title-teaser-10k to prove the multitask fine-tuning approach.
36
+ For use please refer to [gptj-title-teaser-10k](https://huggingface.co/snipaid/gptj-title-teaser-10k).
37
+
38
+
39
+ # Training Details
40
+
41
+ ## Training Data
42
+
43
+ <!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
44
+
45
+ The model was finetuned on a collection of 1,000 news items scraped from different online news outlets in german language.
46
+
47
+ For each news item the dataset contains title, teaser and fulltext.
48
+
49
+ ```
50
+ [
51
+ {
52
+ "title": ...,
53
+ "teaser": ...,
54
+ "fulltext": ...
55
+ },
56
+ ]
57
+ ```
58
+
59
+ ## Training Procedure
60
+
61
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
62
+
63
+ The model was finetuned using a causal language modeling (CLM) objective for multitask finetuning.
64
+
65
+ ### Preprocessing
66
+
67
+ For each news item, two inputs were concatenated like below.
68
+ ```
69
+ f"[Text]: {item.fulltext} \n [Title]: {item.title}"
70
+ f"[Text]: {item.fulltext} \n [Teaser]: {item.teaser}"
71
+ ```
72
+ This results in one input per task for each news item.
73
+
74
+ *Note: The inserted prompt "[Text]:" marks the beginning of the news item's fulltext.
75
+ In the same manner "[Title]:" prompts the news item's title and "[Teaser]:" the news item's teaser.*
76
+
77
+ # Evaluation
78
+
79
+ 1,000 german news articles proved to be sufficient to validate the approach.
80
+ Evaluation showed that the model improved compared to the GPT-J baseline in:
81
+ - german language capabilities (significantly)
82
+ - title generation (significantly)
83
+ - teaser generation (slightly)
84
+
85
+ The evaluation also suggested that there is still opportunity for improvement with more data.
86
+ For the model trained with the same approach but 10x the amount of data pleaser refer to [gptj-title-teaser-10k](https://huggingface.co/snipaid/gptj-title-teaser-10k).
87
+
88
+ # Environmental Impact
89
+
90
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
91
+
92
+ Carbon emissions were estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
93
+
94
+ - **Hardware Type:** A100 SXM4
95
+ - **Hours used:** 2h 42min
96
+ - **Cloud Provider:** Vast.ai
97
+ - **Compute Region:** Unknown
98
+ - **Carbon Emitted:** ~0.47kg co2e
99
+
100
+ # Glossary
101
+
102
+ **News Item**, aka news article or news story. A particular piece of news, usually from a journalistic source.
103
+ **Snippet**, a small section of text that is related to a news item.
104
+ **Title** aka headline. A few words that reflect the essence of the news story.
105
+ **Teaser** aka lede. A few sentences that spark curiousity about the "best of the rest" of the news story.