File size: 3,299 Bytes
6b7b4f2
 
 
 
84c8338
6b7b4f2
 
 
 
 
 
 
330f5d4
 
b2502b0
6b7b4f2
10490e8
6b7b4f2
 
 
 
330f5d4
6b7b4f2
 
2c8901f
 
c81451e
2c8901f
330f5d4
 
 
 
6b7b4f2
330f5d4
 
 
 
 
 
 
 
6b7b4f2
 
 
 
c81451e
 
 
 
 
 
6b7b4f2
 
 
 
 
 
 
10490e8
6b7b4f2
 
 
 
 
 
 
 
330f5d4
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
---
license: mit
tags:
- generated_from_keras_callback
base_model: facebook/bart-large-cnn
model-index:
- name: bart-large-finetuned-filtered-spotify-podcast-summ
  results: []
---

# bart-large-finetuned-filtered-spotify-podcast-summ

This model is a fine-tuned version of [facebook/bart-large-cnn](https://huggingface.co/facebook/bart-large-cnn) on on the [Spotify Podcast Dataset](https://arxiv.org/abs/2004.04270). Take a look to the [github repository](https://github.com/TheOnesThatWereAbroad/PodcastSummarization) of this project.

It achieves the following results during training:
- Train Loss: 2.2967
- Validation Loss: 2.8316
- Epoch: 2

## Intended uses & limitations

This model is intended to be used for automatic podcast summarisation. Given the podcast transcript in input, the objective is to provide a short text summary that a user might read when deciding whether to listen to a podcast. The summary should accurately convey the content of the podcast, be human-readable, and be short enough to be quickly read on a smartphone screen. 

## Training and evaluation data
In our solution, an extractive module is developed to select salient chunks from the transcript, which serve as the input to an abstractive summarizer.
An extensive pre-processing on the creator-provided descriptions is performed selecting a subset of the corpus that is suitable for the training supervised model.

We split the filtered dataset into train/dev sets of 69,336/7,705 episodes. 
The test set consists of 1,027 episodes. Only 1025 have been used because two of them did not contain an episode description.


## How to use

The model can be used for the summarization as follows:

```python
from transformers import pipeline
summarizer = pipeline("summarization", model="gmurro/bart-large-finetuned-filtered-spotify-podcast-summ", tokenizer="gmurro/bart-large-finetuned-filtered-spotify-podcast-summ")
summary = summarizer(podcast_transcript, min_length=39, max_length=250)
print(summary[0]['summary_text'])
```

### Training hyperparameters

The following hyperparameters were used during training:
- ```python
optimizer: {'name': 'AdamWeightDecay', 'learning_rate': 2e-05, 'decay': 0.0, 'beta_1': 0.9, 'beta_2': 0.999, 'epsilon': 1e-07, 'amsgrad': False, 'weight_decay_rate': 0.01}
```
- ```python
training_precision: float32
```

### Training results

| Train Loss | Validation Loss | Epoch |
|:----------:|:---------------:|:-----:|
| 3.0440     | 2.8733          | 0     |
| 2.6085     | 2.8549          | 1     |
| 2.2967     | 2.8316          | 2     |


### Framework versions

- Transformers 4.19.4
- TensorFlow 2.9.1
- Datasets 2.3.1
- Tokenizers 0.12.1


## Authors

|   Name    |  Surname  |                 Email                  |                       Username                        |
| :-------: | :-------: | :------------------------------------: | :---------------------------------------------------: |
| Giuseppe  |   Boezio  | `[email protected]`      | [_giuseppeboezio_](https://github.com/giuseppeboezio) |
| Simone    |  Montali  |    `[email protected]`    |         [_montali_](https://github.com/montali)         |
| Giuseppe  |    Murro  |    `[email protected]`    |         [_gmurro_](https://github.com/gmurro)         |