File size: 4,789 Bytes
bfe021a
 
 
 
 
 
 
b5d0e70
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c0590c7
b5d0e70
 
7e97e95
b5d0e70
 
 
 
 
 
 
7e97e95
b5d0e70
 
 
 
 
7e97e95
b5d0e70
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0e9a054
1eb2379
b5d0e70
 
 
 
 
dfc9cb1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
---
license: mit
language:
- en
- ru
library_name: transformers
---

---

# IsaNLP RST Parser v3

This repository hosts several versions of the IsaNLP RST Parser. For more details, visit the [GitHub repository](https://github.com/tchewik/isanlp_rst). 

## Performance

The following table summarizes the end-to-end performance metrics of different model versions across various corpora:

### Corpora
- **English:** GUM<sub>9.1</sub>, RST-DT
- **Russian:** RRT<sub>2.1</sub>, RRG<sub>GUM-9.1</sub>

| Tag          | Language   | Train Data  | Test Data   | Seg  | S    | N    | R    | Full  |
|--------------|------------|-------------|-------------|------|------|------|------|-------|
| `gumrrg`     | En, Ru     | GUM, RRG    | GUM         | 95.5 | 67.4 | 56.2 | 49.6 | 48.7  |
|              |            |             | RRG         | 97.0 | 67.1 | 54.6 | 46.5 | 45.4  |
| `rstdt`      | En         | RST-DT      | RST-DT      | 97.8 | 75.6 | 65.0 | 55.6 | 53.9  |
| `rstreebank` | Ru         | RRT         | RRT         | 92.1 | 66.2 | 53.1 | 46.1 | 46.2  |

## Usage

To use the IsaNLP RST Parser with Hugging Face, follow these steps:

1. **Install the necessary Python package:**

   You will need the `isanlp_rst` library, which is available via pip:

   ```bash
   pip install isanlp_rst
   ```

2. **Example code for parsing RST:**

   The following Python code demonstrates how to run a specific version of the parser using the Hugging Face model:

   ```python
   from isanlp_rst.parser import Parser

   # Define the version of the model you want to use
   version = 'gumrrg'  # from {'gumrrg', 'rstdt', 'rstreebank'}
   
   # Initialize the parser with the desired version
   parser = Parser(hf_model_name='tchewik/isanlp_rst_v3', hf_model_version=version, cuda_device=0)

   # Example text for parsing
   text = """
   On Saturday, in the ninth edition of the T20 Men's Cricket World Cup, Team India won against South Africa by seven runs. 
   The final match was played at the Kensington Oval Stadium in Barbados. This marks India's second win in the T20 World Cup, 
   which was co-hosted by the West Indies and the USA between June 2 and June 29.

   After winning the toss, India decided to bat first and scored 176 runs for the loss of seven wickets. 
   Virat Kohli top-scored with 76 runs, followed by Axar Patel with 47 runs. Hardik Pandya took three wickets, 
   and Jasprit Bumrah took two wickets.
   """

   # Parse the text to obtain the RST tree
   res = parser(text)  # res['rst'] contains the binary discourse tree

   # Display the structure of the RST tree
   vars(res['rst'])
   ```

3. **Understanding the Output:**

   ```python
   {
     'id': 7,
     'left': <isanlp.annotation_rst.DiscourseUnit at 0x7f771076add0>,
     'right': <isanlp.annotation_rst.DiscourseUnit at 0x7f7750b93d30>,
     'relation': 'elaboration',
     'nuclearity': 'NS',
     'start': 0,
     'end': 336,
     'text': "On Saturday, ... took two wickets .",
   }
   ```

   - **id**: A unique identifier for the discourse unit.
   - **left** and **right**: The left and right children of the current discourse unit.
   - **relation**: The rhetorical relation between the two sub-units. In this example, the relation is "elaboration," indicating that one part provides additional detail about the other.
   - **nuclearity**: Indicates the nuclearity of the relation. "NS" means that the left unit is the nucleus (N) and the right unit is the satellite (S).
   - **start** and **end**: The character offsets in the text for this discourse unit.
   - **text**: The text span corresponding to this discourse unit.

4. **(Optional) Save the result in RS3 format:**

   If you wish to save the resulting RST tree in the *.rs3 file, you can easily do so using the following command:

   ```python
   # Export the RST tree to an RS3 file
   res['rst'][0].to_rs3('filename.rs3')
   ```

   The `filename.rs3` can then be opened in RSTTool or rstWeb for visualization or editing:

   <!-- ![RST Example]() -->
   <img src="https://huggingface.co/tchewik/isanlp_rst_v3/resolve/main/example-image.png" alt="RSTTool Example" width="800">

## Citation

If you use the IsaNLP RST Parser in your research, please cite our work as follows:

- **For versions `gumrrg`, `rstdt`, and `rstreebank`:** 
  ```bibtex
  @inproceedings{
   chistova-2024-bilingual,
   title = "Bilingual Rhetorical Structure Parsing with Large Parallel Annotations",
   author = "Chistova, Elena",
   booktitle = "Findings of the Association for Computational Linguistics ACL 2024",
   month = aug,
   year = "2024",
   address = "Bangkok, Thailand and virtual meeting",
   publisher = "Association for Computational Linguistics",
   url = "https://aclanthology.org/2024.findings-acl.577",
   pages = "9689--9706"
  }
```