File size: 3,725 Bytes
bdf471c
4361b48
bdf471c
4361b48
 
 
 
 
 
 
bdf471c
8678bd9
4361b48
 
 
 
 
 
8678bd9
4361b48
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b47608a
4361b48
 
05ef892
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
---
language: de
license: mit
inference: false
tags:
- gptj
- title generation
- headline generation
- teaser generation
- news
---
# GPT-J-Title-Teaser-1k

<!-- Provide a quick summary of what the model is/does. -->

gptj-title-teaser-1k  
Version 1.0 / 22 December 2022

A proof of concept for multitask fine-tuning [GPT-J-6B-8bit](https://huggingface.co/hivemind/gpt-j-6B-8bit) for german news title and teaser generation.

# Model Details

## Model Description

- **Developed by:** snipaid
- **Model type:** gptj
- **Language(s) (NLP):** de
- **License:** MIT
- **Finetuned from model:** [GPT-J-6B-8bit](https://huggingface.co/hivemind/gpt-j-6B-8bit)

# Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

This model is not intended for use! It is a preliminary version of gptj-title-teaser-10k to prove the multitask fine-tuning approach.  
For use please refer to [gptj-title-teaser-10k](https://huggingface.co/snipaid/gptj-title-teaser-10k).


# Training Details

## Training Data

<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

The model was finetuned on a collection of 1,000 news items scraped from different online news outlets in german language.

For each news item the dataset contains title, teaser and fulltext.

```
[
 {
    "title": ...,
    "teaser": ...,
    "fulltext": ...
  },
]
```

## Training Procedure

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

The model was finetuned using a causal language modeling (CLM) objective for multitask finetuning.

### Preprocessing

For each news item, two inputs were concatenated like below.
```
f"[Text]: {item.fulltext} \n [Title]: {item.title}"
f"[Text]: {item.fulltext} \n [Teaser]: {item.teaser}"
```
This results in one input per task for each news item.

*Note: The inserted prompt "[Text]:" marks the beginning of the news item's fulltext.  
In the same manner "[Title]:" prompts the news item's title and "[Teaser]:" the news item's teaser.*

# Evaluation

1,000 german news articles proved to be sufficient to validate the approach.
Evaluation showed that the model improved compared to the GPT-J baseline in:
- german language capabilities (significantly)
- title generation (significantly)
- teaser generation (slightly)

The evaluation also suggested that there is still opportunity for improvement with more data.  
For the model trained with the same approach but 10x the amount of data pleaser refer to [gptj-title-teaser-10k](https://huggingface.co/snipaid/gptj-title-teaser-10k).

# Environmental Impact

<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->

Carbon emissions were estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

- **Hardware Type:** A100 SXM4
- **Hours used:** 2h 42min
- **Cloud Provider:** Vast.ai
- **Compute Region:** Unknown
- **Carbon Emitted:** ~0.47kg co2e

# Glossary

**News Item**, aka news article. A particular piece of news, usually from a journalistic source.  
**Snippet**, a small section of text that is related to a news item.  
**Title** aka headline. A few words that reflect the essence of the news story.  
**Teaser** aka lede. A few sentences that spark curiosity about the "best of the rest" of the news story.