roemmele commited on
Commit
e4df92c
1 Parent(s): c4fc5ed

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +134 -1
README.md CHANGED
@@ -1,3 +1,136 @@
1
  ---
2
- license: cc-by-4.0
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ datasets:
3
+ - roemmele/ablit
4
+ language:
5
+ - en
6
+ pipeline_tag: summarization
7
  ---
8
+ # Model Card for Model ID
9
+
10
+ <!-- Provide a quick summary of what the model is/does. -->
11
+
12
+ This model is initialized from facebook/bart-base. It has been fine-tuned on the AbLit dataset, which consists of abridged versions of books aligned with their original versions at the passage level. Given a text, the model generates an abridgement of the text based on what it has observed in AbLit. See the cited paper for more details.
13
+
14
+ ## Model Details
15
+
16
+ ### Model Description
17
+
18
+ <!-- Provide a longer summary of what this model is. -->
19
+
20
+ - **Developed by:** Language Weaver (Melissa Roemmele, Kyle Shaffer, Katrina Olsen, Yiyi Wang, and Steve DeNeefe)
21
+ - **Model type:** Seq2SeqLM
22
+ - **Language(s) (NLP):** English
23
+ - **License:** [More Information Needed]
24
+ - **Finetuned from model [optional]:** facebook/bart-base
25
+
26
+ ### Model Sources [optional]
27
+
28
+ <!-- Provide the basic links for the model. -->
29
+
30
+ - **Repository:** [github.com/roemmele/AbLit](https://github.com/roemmele/AbLit)
31
+ - **Paper [optional]:** [AbLit: A Resource for Analyzing and Generating Abridged Versions of English Literature](https://arxiv.org/pdf/2302.06579.pdf)
32
+
33
+ ## Uses
34
+
35
+ This model generates abridged versions of texts informed by the AbLit dataset.
36
+
37
+ ## Bias, Risks, and Limitations
38
+
39
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
40
+
41
+ This model comes from research on abridgement as an NLP task, but the dataset the model is trained on (AbLit) is derived from a small set of texts associated with a specific domain and author. There are significant practical reasons for this limited scope. In particular, in constrast to the books in AbLit, most recently published books are not included in publicly accessible datasets due to copyright restrictions, and the same restrictions typically apply to any abridgements of these books. For this reason, AbLit consists of British English literature from the 18th and 19th centuries. Some of the linguistic properties of these original books do not generalize to other types of English texts, and therefore the model might not produce desirable abridgements for these other texts.
42
+
43
+ ## How to Get Started with the Model
44
+
45
+ ```
46
+ In [1]: from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
47
+ ...: tokenizer = AutoTokenizer.from_pretrained("roemmele/ablit-bart-base")
48
+ ...: model = AutoModelForSeq2SeqLM.from_pretrained("roemmele/ablit-bart-base")
49
+ ...:
50
+ ...: passage = "The letter was not unproductive. It re-established peace and kindness."
51
+ ...: input_ids = tokenizer(
52
+ ...: passage,
53
+ ...: padding='max_length',
54
+ ...: return_tensors="pt").input_ids
55
+ ...: output_ids = model.generate(
56
+ ...: input_ids,
57
+ ...: max_length=1024,
58
+ ...: num_beams=5,
59
+ ...: no_repeat_ngram_size=3
60
+ ...: )[0]
61
+ ...: abridgement = tokenizer.decode(
62
+ ...: output_ids,
63
+ ...: skip_special_tokens=True)
64
+
65
+ In [2]: print(abridgement)
66
+ The letter re-established peace and kindness.
67
+ ```
68
+
69
+ ## Training Details
70
+
71
+ ### Training Data
72
+
73
+ [roemmele/AbLit](https://huggingface.co/datasets/roemmele/ablit), specifically the train split of the "chunks-10-sentences" subset, i.e.:
74
+
75
+ ```
76
+ from datasets import load_dataset
77
+ data = load_dataset("roemmele/ablit", "chunks-10-sentences")
78
+ ```
79
+
80
+ ### Training Procedure
81
+
82
+ We used the training script [here](https://github.com/huggingface/transformers/blob/main/examples/pytorch/summarization/run_summarization.py).
83
+
84
+ Hyperparameter settings: We specified maximum length of 1024 for both the source (original passage) and target (abridged passage), and truncated all tokens beyond this limit. We evaluated each model on the AbLit development set after each epoch and concluded training when cross-entropy loss stopped decreasing. We used a batch size of 4. For all other hyperparameters we used the default values set by this script.
85
+
86
+
87
+ #### Training Hyperparameters
88
+
89
+ - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
90
+
91
+ #### Speeds, Sizes, Times
92
+
93
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
94
+ It took ≈3 hours to train each model on a g4dn.4xlarge AWS instance.
95
+
96
+ ## Evaluation
97
+
98
+ ### Testing Data
99
+
100
+ <!-- This should link to a Data Card if possible. -->
101
+ Test split of "chunks-10-sentences" subset of [roemmele/AbLit](https://huggingface.co/datasets/roemmele/ablit)
102
+
103
+ ### Results
104
+
105
+ <!-- This section describes the evaluation protocols and provides the results. -->
106
+ The model obtained a ROUGE-L score of 0.78 on the AbLit test set. See the paper for the results of other metrics.
107
+
108
+ ### Conclusion
109
+
110
+ Our analysis shows that in comparison with human-authored abridgements, the model-generated abridgements tend to preserve more of the original text, suggesting it is challenging to learn what text can be removed while maintaining loyalty to the important parts of the original text.
111
+
112
+ ## Citation [optional]
113
+
114
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
115
+
116
+ **BibTeX:**
117
+
118
+ @inproceedings{roemmele2023ablit,
119
+ title={AbLit: A Resource for Analyzing and Generating Abridged Versions of English Literature},
120
+ author={Roemmele, Melissa and Shaffer, Kyle and Olsen, Katrina and Wang, Yiyi and DeNeefe, Steve},
121
+ booktitle = {Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume},
122
+ publisher = {Association for Computational Linguistics},
123
+ year={2023}
124
+ }
125
+
126
+ **APA:**
127
+
128
+ Roemmele, M., Shaffer, K., Olsen, K., Wang, Y., and DeNeefe, S. (2023). AbLit: A Resource for Analyzing and Generating Abridged Versions of English Literature. 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2023).
129
+
130
+ ## Model Card Authors
131
+
132
+ Melissa Roemmele
133
+
134
+ ## Model Card Contact
135
+
136