LED_para document simplification model

This is a pretrained version of the document simplification model presented in the Findings of ACL 2023 paper "Context-Aware Document Simplification".

It is an end-to-end system based on the Longformer encoder-decoder that operates at the paragraph-level.

Target reading levels (1-4) should be indicated via a control token prepended to each input sequence ("<RL_1>", "<RL_2>", "<RL_3>", "<RL_4>"). If using the terminal interface, this will be handled automatically.

How to use

It is recommended to use the plan_simp library to interface with the model.

Here is how to use this model in PyTorch:

from plan_simp.models.bart import load_simplifier

simplifier, tokenizer, hparams = load_simplifier("liamcripwell/ledpara")

text = "<RL_3> Turing has an extensive legacy with statues of him and many things named after him, including an annual award for computer science innovations. He appears on the current Bank of England £50 note, which was released on 23 June 2021, to coincide with his birthday. A 2019 BBC series, as voted by the audience, named him the greatest person of the 20th century."
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, num_beams=5)

Generation and evaluation can also be run from the terminal.

python plan_simp/scripts/generate.py inference 
    --model_ckpt=liamcripwell/ledpara 
    --test_file=<test_data>
    --reading_lvl=s_level 
    --out_file=<output_csv>

python plan_simp/scripts/eval_simp.py
    --input_data=newselaauto_docs_test.csv
    --output_data=test_out_ledpara.csv
    --x_col=complex_str
    --r_col=simple_str
    --y_col=pred
    --doc_id_col=pair_id
    --prepro=True
    --sent_level=True